AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

University of California, Berkeley
*Corresponding author
AgentSynth Pipeline Overview

AgentSynth leverages information asymmetry to construct complex tasks from simple, solvable subtasks, enabling scalable synthesis of diverse computer-use agent datasets.

Abstract

We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of $0.60 per trajectory, orders of magnitude cheaper than human annotations.

Task Generation Process

Given a persona, the task proposer generates an initial task, which is followed by a sequence of subtasks executed by the agent. Each step is verified; if execution fails, a revised subtask description is generated. After n successful steps, a summarization agent composes final high-level tasks. Tasks at different difficulty levels are formed by summarizing the first 1 to n subtasks, enabling controllable task complexity.

Task Generation Process Example

Case Study

To illustrate the quality and realism of tasks generated by the AgentSynth pipeline, we present a representative example. Additional examples can be found in Appendix B.

First, a persona is sampled from the persona hub:

Persona: a senior student at Kentucky Wesleyan College.

Then, the task proposer generated an initial task tailored to this persona:

Initial task: Search for the 'Kentucky Wesleyan College 2024 academic calendar' in Google Chrome.

Next, this initial task was successfully executed, and five follow-up tasks were iteratively generated and completed:

  • Follow-up Task 1 Find the Kentucky Wesleyan College 2024 commencement (graduation) date on the academic calendar currently open in Chrome.
  • Follow-up Task 2 Open the Calendar application after searching for graduation-related dates on an academic calendar website.
  • Follow-up Task 3 Scroll backwards month-by-month in the calendar application from March 2025 to June 2024 using the month view.
  • Follow-up Task 4 Create a new calendar event on the day of Kentucky Wesleyan College's 2024 commencement titled 'Graduation Day' and add a note: 'Remember to bring gown and arrive 1 hour early'.
  • Follow-up Task 5 Add a notification/reminder to the 'Graduation Day' event on May 3, 2024 in the Calendar app to alert you 1 day before.

Each of these subtasks is simple and logically follows from the previous one. The task summarizer composes them into coherent, high-level tasks. We define task difficulty level n as the summary of the first n subtasks, resulting in increasingly complex and realistic scenarios. These summarized tasks are then used for evaluation and benchmarking. The list below shows the final task descriptions at six difficulty levels, with italicized text indicating the incremental complexity introduced at each level:

  • Task Difficulty Level 1 (same as the initial task): Search for the 'Kentucky Wesleyan College 2024 academic calendar' in Google Chrome.
  • Task Difficulty Level 2: Find and report the date and time of the Kentucky Wesleyan College 2024 commencement ceremony by searching for the 2024 academic calendar online, locating the official calendar, and identifying the commencement event listed there.
  • Task Difficulty Level 3: Search for the 'Kentucky Wesleyan College 2024 academic calendar' in Google Chrome, find the 2024 commencement date, and then open the Calendar application to view or record the commencement date.
  • Task Difficulty Level 4: Find the Kentucky Wesleyan College 2024 commencement date using Google Chrome, then open the Calendar application and scroll back in month view from March 2025 to June 2024 in preparation for viewing or adding the graduation date to the calendar.
  • Task Difficulty Level 5: Find the Kentucky Wesleyan College 2024 commencement date by searching online using the academic calendar, and create a new event titled 'Graduation Day' in your digital Calendar application, adding a note that says 'Remember to bring gown and arrive 1 hour early'.
  • Task Difficulty Level 6: Find the Kentucky Wesleyan College 2024 commencement date by searching the 2024 academic calendar online, then create a calendar event titled 'Graduation Day' in the Calendar application with a note saying 'Remember to bring gown and arrive 1 hour early,' and set a reminder to alert you one day before the event.

As the task level increases, both task length and complexity grow accordingly. Each additional subtask introduces new actions, tools, or planning steps. This shows the average token count across task levels, confirming that longer task compositions correspond to more elaborate task descriptions and execution requirements.

We note that our notion of task difficulty corresponds primarily to task horizon and the compositional complexity of multiple subtasks. However, we acknowledge that a task considered hard under this criterion may reflect both intrinsic complexity and lack of familiarity to agents trained primarily on shorter or simpler tasks. Intrinsic complexity, such as the cognitive load required to manage multi-application workflows, maintain intermediate state, and recover from errors, often increases with task horizon, but shorter tasks may also be intrinsically challenging if they involve nuanced visual perception, context-dependent decisions, or unfamiliar interactions. Future analyses could systematically separate intrinsic task complexity from novelty or lack of exposure.

Data composition for AgentSynth

AgentSynth's diverse topics and software involved demonstrate the potential to train generatlist computer-use agents

Experiment results

Model performance across task difficulty levels (top) and across domain categories at difficulty level 1 (bottom) on the AgentSynth benchmark.

BibTeX

@article{xie2025agentsynth,
  title={AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents},
  author={Xie, Jingxu and Xu, Dylan and Zhao, Xuandong and Song, Dawn},
  journal={arXiv preprint arXiv:2506.14205},
  year={2025}
}