OpenAI GDPval: AI Masters Structure, Fails Text Reasoning

OpenAI has released a new benchmark, GDPval, revealing that top AI models are now achieving expert-level performance on a wide range of professional knowledge work. The initial results show models like the unreleased GPT-5 and Anthropic’s Claude Opus 4.1 are rated as equal to or better than human professionals in nearly half of the evaluated tasks. This marks a significant step in AI’s practical application in the workplace. However, OpenAI’s latest research on knowledge work also uncovers a critical nuance: this “expert” status is heavily dependent on the task format.
While the models demonstrate remarkable proficiency in generating structured outputs like PowerPoint presentations and Excel spreadsheets, their performance drops significantly on tasks requiring pure plain-text reasoning, where they still lag behind their human counterparts.
Key Points
- OpenAI’s GDPval benchmark evaluates frontier AI models on complex, real-world professional tasks.
- Top models were rated equal to or better than human experts in approximately 50% of gold-standard tasks.
- AI performance is highest on structured formats like PowerPoint, with win rates near 50%.
- The OpenAI GDPval benchmark highlights AI reasoning limitations, with win rates dropping below 25% for plain-text tasks.
Beyond Academic Metrics: The Economic Reality Test
The GDPval benchmark represents a deliberate shift from academic metrics to evaluations grounded in economic reality. Its architecture is designed to simulate the complex demands of high-value professional roles, providing a more accurate measure of an AI’s readiness for the modern workplace.
To achieve this, the benchmark includes 1,320 tasks that span 44 professions across nine major industries, each contributing over 5% to the U.S. GDP. The chosen roles are high-paying and consist of at least 60% non-physical tasks, according to an analysis based on U.S. Bureau of Labor Statistics data .
According to OpenAI, the tasks were created by industry professionals with an average of 14 years of experience, ensuring they reflect real-world complexity. This moves beyond simple text prompts to require complex deliverables, such as producing both a 3D model and a PowerPoint presentation for an engineering problem.
Evaluation is conducted through blind tests where industry experts compare the AI’s output against a human-created solution, scoring it as “better,” “as good as,” or “worse than.” OpenAI co-founder Greg Brockman described the benchmark as “an early step toward better methods for measuring and forecasting real-world model progress” (blockchain.news), signaling a strategic focus on tangible workplace impact.
Style Over Substance: The Format Paradox
While the headline results suggest AI is closing the gap with human experts, a deeper analysis of the performance data reveals a telling paradox. The AI expert performance on structured vs unstructured data is dramatically different, indicating that presentation may be as important as substance in the current generation of models.
In the 220 gold-standard tasks, models like GPT-5 and Claude Opus 4.1 achieved near-human parity, with expert reviewers rating their outputs as equal to or better than the human benchmark in approximately half of the cases . GPT-5 also showed substantial generational gains over GPT-4o, with scores doubling or tripling on some metrics. The research also noted distinct strengths between the leading models, with Claude Opus 4.1 excelling in aesthetics and formatting while GPT-5 showed a lead in core expertise and accuracy . However, the most insightful finding relates to file formats.
On tasks that require plain text submissions, where pure writing and reasoning are tested, AI struggles with plain text reasoning. GPT-5 achieved just a 22% win rate against the human benchmark, and Claude Opus 4.1 was even lower at 14%.
In stark contrast, the models’ performance surged when creating structured files. The data shows AI excels in PowerPoint, Excel, but not text. Claude Opus 4.1’s win rate jumped to 48% for PowerPoint presentations and 45% for Excel files (The Decoder). This disparity suggests that the models’ ability to generate well-formatted, visually organized outputs heavily influences their perceived quality, while their core reasoning abilities still require human oversight.
Speed Demons: The 100x First Draft Factory
The benchmark’s findings align with the current trajectory of enterprise AI, where models are being positioned as powerful assistants rather than autonomous replacements. GDPval evaluates models on isolated, “one-shot” tasks without any opportunity for feedback or iteration—a scenario that OpenAI acknowledges rarely reflects actual knowledge work.
This limitation highlights the emerging role of AI as a high-speed “first draft” generator. For the tasks themselves, the models are estimated to be 100 times faster and cheaper than human experts, making them ideal for accelerating the initial, often tedious, stages of a project. This frees up human professionals to focus on the critical, iterative work of refinement, strategic thinking, and client interaction.

This push for enterprise-ready AI is not happening in a vacuum. The global race for frontier models is intensifying, with competitors like Alibaba recently releasing its Qwen3 series, a model demonstrating “near-frontier coding and agentic performance,” signaling that the development of expert-level AI is a worldwide phenomenon. This augmentation model is already being embedded into next-generation workflows. OpenAI’s preview of ChatGPT Pulse aims to create a proactive assistant that prepares personalized daily updates.
Similarly, Microsoft is expanding its 365 Copilot to integrate models from both OpenAI and Anthropic into its “Researcher” agent, designed for deep reasoning across enterprise data. These tools are built to handle structured components of a job, leaving the nuanced, collaborative work to humans.
The PowerPoint Paradox: Redefining Digital Expertise
OpenAI’s GDPval benchmark provides compelling evidence that frontier AI models are achieving a new level of capability in professional settings. The documented efficiency gains in speed and cost signal a significant development in productivity tools for knowledge workers. The results confirm these systems are becoming adept at producing high-quality initial outputs for a variety of complex tasks.
However, the data also brings a crucial distinction into focus: the models’ current “expertise” is heavily skewed toward structured presentation, while their unadorned reasoning still trails human ability. Recognizing these limitations, OpenAI plans to evolve the benchmark to include more realistic, interactive tasks and feedback loops in the future. As these powerful systems become more deeply integrated into our daily workflows, how will our definition of professional expertise need to evolve to account for this new human-machine partnership?
Read More From AI Buzz

Perplexity pplx-embed: SOTA Open-Source Models for RAG
Perplexity AI has released pplx-embed, a new suite of state-of-the-art multilingual embedding models, making a significant contribution to the open-source community and revealing a key aspect of its corporate strategy. This Perplexity pplx-embed open source release, built on the Qwen3 architecture and distributed under a permissive MIT License, provides developers with a powerful new tool […]

New AI Agent Benchmark: LangGraph vs CrewAI for Production
A comprehensive new benchmark analysis of leading AI agent frameworks has crystallized a fundamental challenge for developers: choosing between the rapid development speed ideal for prototyping and the high-consistency output required for production. The data-driven study by Lukasz Grochal evaluates prominent tools like LangGraph, CrewAI, and Microsoft’s new Agent Framework, revealing stark tradeoffs in performance, […]
