Nearly two years ago, Microsoft CEO Satya Nadella boldly predicted that artificial intelligence would fundamentally transform and even replace "knowledge work" – the domain of white-collar professionals like lawyers, investment bankers, accountants, and IT specialists. Despite significant advancements in foundational AI models, the widespread impact on these roles has been surprisingly slow to materialize. While AI has demonstrated prowess in in-depth research and strategic planning, much of the white-collar sector remains largely untouched by its disruptive potential.
This discrepancy has become one of the most intriguing puzzles in the AI landscape. Thanks to groundbreaking new research from Mercor, a prominent training-data provider, we are finally gaining crucial insights into why AI agents are not yet ready for prime time in the professional workplace.
New Benchmark Reveals AI's Workplace Struggles
Mercor's research introduces a novel benchmark called Apex-Agents, designed to rigorously test how leading AI models perform on actual white-collar tasks drawn from consulting, investment banking, and law. The results have been sobering: every AI lab tested received a failing grade. When confronted with queries from real professionals, even the most advanced models struggled, correctly answering less than a quarter of the questions. The vast majority of the time, models either provided an incorrect response or no answer at all.
Brendan Foody, a researcher involved in the paper, identified the models' primary weakness: their inability to track and synthesize information across multiple domains. This skill is critical for human knowledge workers, who routinely navigate diverse data sources and contexts.
"One of the big changes in this benchmark is that we built out the entire environment, modeled after how real professional services," Foody told TechCrunch. "The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools."
For many agentic AI models, this kind of multi-domain reasoning remains inconsistent and challenging.
The scenarios used in the Apex-Agents benchmark were developed by actual professionals from Mercor's expert marketplace, who also established the criteria for successful responses. A review of the publicly available questions on Hugging Face quickly illustrates the complexity involved.
A Glimpse at the Challenge
Consider a question from the "Law" section:
During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor…Under Northstar’s own policies, it can reasonably treat the one or two log exports as consistent with Article 49?
The correct answer is "yes," but arriving at it requires a deep understanding of both the company's internal policies and pertinent EU privacy laws. Such a task could challenge even a well-informed human professional.
Foody emphasized the real-world relevance of the benchmark, stating, "I think this is probably the most important topic in the economy. The benchmark is very reflective of the real work that these people do." If an LLM could consistently answer these questions, it would indeed signify a major step towards automating many legal roles.
Apex-Agents vs. GDPVal: A Deeper Dive into Professional Skills
OpenAI previously introduced its GDPVal benchmark to assess professional skills, but Apex-Agents distinguishes itself in crucial ways. While GDPVal measures general knowledge across a broad spectrum of professions, Apex-Agents specifically evaluates an AI system's capacity to perform sustained, intricate tasks within a narrow set of high-value professions. This targeted approach makes the Apex-Agents test significantly more challenging for models but also provides a more accurate indicator of whether these specialized jobs can truly be automated.
Current Performance and Future Outlook
While no AI model proved capable of taking over as an investment banker, some performed better than others. Gemini 3 Flash led the group with a 24% one-shot accuracy, closely followed by GPT-5.2 at 23%. Below them, Opus 4.5, Gemini 3 Pro, and GPT-5 all scored approximately 18%.
Despite these initial shortcomings, the AI field has a track record of rapidly overcoming challenging benchmarks. With the Apex-Agents test now public, it serves as an open invitation for AI labs to demonstrate their capabilities, a challenge Foody fully expects them to meet in the coming months.
"It's improving really quickly," Foody noted. "Right now it's fair to say it's like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or ten percent of the time. That kind of improvement year after year can have an impact so quickly."







