---
title: "The Reality of AI in Consulting: A Mercor Perspective"
description: "The question of whether artificial intelligence can truly perform professional consulting work has moved from academic curiosity to practical urgency. Businesses, prospects, and even consulting firms..."
url: https://strategycase.com/the-reality-of-ai-in-consulting-a-mercor-perspective/
date: 2026-02-08
modified: 2026-02-10
author: "Florian Smeritschnig"
image: https://strategycase.com/wp-content/uploads/2026/02/AI-in-consulting-benchmark-scaled.jpg
categories: ["AI in Consulting", "Consulting Applications", "Consulting Career"]
type: post
lang: en
---

# The Reality of AI in Consulting: A Mercor Perspective

The question of whether artificial intelligence can truly perform professional consulting work has moved from academic curiosity to practical urgency. Businesses, prospects, and even consulting firms themselves are experimenting with autonomous AI agents in hopes they might eventually replace or augment human consultants. But recent research from **Mercor**, a research and expert-marketplace startup, suggests that today’s AI agents are nowhere near ready to take over the core analytical and decision-making work that drives consulting value.

## **A New Way of Evaluating AI in a Professional Consulting Context**

Mercor’s new benchmark, called **APEX-Agents**, was intentionally built to fill a gap left by traditional AI evaluations. Instead of testing isolated question–and–answer ability, APEX-Agents simulates *real professional workflows* – long-horizon, multi-step tasks that professionals actually face when delivering client work. The researchers involved in Mercor’s project surveyed hundreds of experts from top firms, then constructed data-rich “work environments” with documents, spreadsheets, emails, and tool interactions that an agent would need to navigate to complete meaningful tasks.

In one simulated consulting world, for instance, an AI agent might be tasked with navigating a file system filled with client reports, analyzing consumption patterns, computing penetration metrics using spreadsheets, and then drafting a concise summary of insights. These are not multiple-choice quizzes; they are complex missions that professionals expect to spend hours digesting and executing.

## **The Results are Out**

When Mercor tested a range of leading AI agents on these scenarios, the results were sobering. Even the best-performing models completed fewer than **25 percent of tasks correctly on the first try (Pass@1)**. According to Mercor’s Pass@1 leaderboard, models like Google *Gemini 3 Flash* and *OpenAI GPT-5.2* performed best, but their success rates hovered near 24% and 23% respectively. Others lagged behind.

!(https://strategycase.com/wp-content/uploads/2026/02/image-1024x571.png)*Mercor AI Benchmark (Source: Mercor)*

What do these numbers mean in practice?

A Pass@1 score reflects whether an agent can finish a complex task *correctly on its first attempt*. That is, to pass an APEX-Agents task, the output must meet all the rubric criteria defined by consulting experts. A score in the low 20s suggests that even state-of-the-art systems fail the vast majority of real professional tasks.

Mercor’s benchmark also finds that even when agents are allowed multiple tries, performance remains sub-par. With eight attempts on each task, the best agent’s success rate rises only to about 40 percent. Even with retries and retries, agents lack the reliability many business leaders assume they already have.

This is not because generative AI lacks intelligence, nor because it cannot produce formatted slides or drafts of analytical text. It is because *consulting thinking is fundamentally an end-to-end process*: it involves defining the right questions, prioritizing where to look for information, navigating fragmented knowledge across domains, and synthesizing coherent recommendations that hold up in an ambiguous context.

## **AI Models Still Fall Short**

Mercor’s research highlights precisely where agents fall short. They struggle to manage ambiguity, find the relevant facts buried across multiple files and formats, and maintain context over prolonged sequences of decisions – exactly the challenges consultants face every day.

Put simply: today’s AI can assist analysis, speed up drafts, and improve routine tasks, but it *does not yet replicate the core cognitive work of consulting professionals*.

For StrategyCase readers, this has implications on two fronts. First, it should temper over-optimistic narratives that (https://strategycase.com/artificial-intelligence-and-the-junior-consultant-how-entry-training-and-work-are-changing/). The ability to *execute complex, multi-step client tasks reliably and coherently* remains a distinctly human strength. Second, for candidates preparing for consulting interviews, (https://strategycase.com/why-ai-assisted-case-prep-wont-save-you-if-you-lack-structure/) than polished outputs alone. Agencies like Mercor intentionally modeled their tasks around real consulting workflows because that is ultimately what employers care about — *not just clean text or slides, but decisions that flow from structured thinking*. McKinsey has already introduced an (https://strategycase.com/mckinsey-ai-interview/) that evaluates how candidates perform when working collaboratively with their in-house AI.

The APEX-Agents benchmark is also an open-source contribution, giving researchers and developers a high-fidelity yardstick for improving agent architectures and training methods. Mercor has released the entire dataset, evaluation infrastructure, gold outputs, and rubrics under a Creative Commons license, inviting the AI community to iterate on these tasks.

## **Challenges Ahead, for AI and Humans**

In the broader discussion about automation and knowledge work, Mercor’s results remind us of a fundamental truth: *AI today excels at parts of a job; humans still excel at the job itself*. Consultants integrate information across domains, wrestle with ambiguity, make judgment calls that blend experience with data, and deliver coherence at scale. Benchmarks like APEX-Agents show that while AI will increasingly augment human capability, the leap to full autonomy in professional consulting remains a substantial challenge.

In other words, AI will change (https://strategycase.com/4-typical-tasks-of-entry-level-consultants/), but not necessarily *replace the role of the consultant* in shaping decisions.

For full details, see Mercor’s **APEX-Agents summary** ((https://www.mercor.com/blog/introducing-apex-agents?utm_source=chatgpt.com)) and the underlying **arXiv research paper** ((https://arxiv.org/abs/2601.14242?utm_source=chatgpt.com)), which rigorously document the benchmark design, evaluation rubrics, and model results.
