AI & UXR

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools

MIN

Mar 26, 2026

📌 Key Takeaways

HLE (Humanity’s Last Exam) tests academic expertise—it’s not a practical test for UX work.
METR measures how autonomously AI systems can operate—relevant for product design, not for tool selection.
Models’ HLE scores rose from under 10% to nearly 50% in one year—benchmarks become obsolete rapidly.
A METR study showed: Developers were 19% slower with AI—but believed they were faster.
Benchmark scores systematically overestimate the everyday performance of AI tools.
For UX work, other criteria matter: context fidelity, iterability, hallucination rate.
The most useful benchmark for you is your own workflow—tested with real tasks.

Introduction

Last week, an AI model broke another record on one of the toughest AI tests. And maybe you caught a glimpse of it—as a LinkedIn post, as a tech newsletter snippet—and then asked yourself: What does this actually mean for my work?

Good question. The answer is usually: less than you think.

In this article, I explain what lies behind two of the most talked-about AI benchmarks today—Humanity’s Last Exam (HLE) and METR—and why, while their results are fascinating, they have limited relevance for everyday UX work. You’ll also learn which criteria are truly relevant if you want to use AI tools for UX research or UX AI prompting.

I’ve been working in the UX field since 1999 and have been observing how AI is transforming the industry for several years now. What keeps coming back to me is this: the discussion about AI capabilities often takes place in a language that has little to do with the day-to-day reality of UX professionals. I’d like to change that here.

What are HLE and METR—and why should you know about them?

Before we put the numbers into context, we need clarity on the terms. HLE and METR measure very different things—and both are only indirectly related to what UX professionals do on a daily basis.

HLE—the toughest knowledge test there is

Humanity’s Last Exam (HLE) is a benchmark developed jointly by the Center for AI Safety and Scale AI. It consists of 2,500 expert-level questions—covering mathematics, physics, chemistry, biology, computer science, and the humanities. The questions were contributed by researchers and doctoral students from over 500 institutions worldwide.

The name says it all: HLE is intended to be the last academic benchmark of its kind—because previous tests like MMLU are now solved by AI models with over 90% accuracy and are therefore hardly meaningful anymore.

What makes HLE special: The questions are phrased in such a way that you can’t simply Google them. A model must demonstrate genuine reasoning, not pattern recognition. The answers are unambiguous and automatically evaluable—either correct or incorrect.

An example of the difficulty level (hypothetical scenario): “How many paired tendons are supported by a specific sesamoid bone in the tail musculature of hummingbirds?” This is not a question a model knows from training. It must reason.

METR – Not Knowledge, but Autonomous Action

METR (Model Evaluation & Threat Research) is a non-profit organization in Berkeley that measures something fundamentally different: not what a model knows, but what it can do on its own.

The key metric is the so-called “Time Horizon”—the task duration at which an AI agent has a 50% probability of successfully completing a task. This is measured using real-world software tasks that typically take humans minutes to hours to complete [METR, 2025].

The focus is on security issues: Can a model autonomously acquire resources? Can it replicate itself? Can it perform tasks over many hours without human supervision? This is relevant for AI security research and for companies that want to deploy autonomous AI agents.

How do current models perform—and what do these numbers really mean?

The numbers are impressive—but should be taken with a grain of salt.

HLE: From zero to nearly fifty percent in one year

At the time of publication in January 2025, all tested models were below 10%: GPT-4o achieved 3.3%, Claude 3.5 Sonnet 4.3%, and the then-leading model o1 around 9% [Scale AI, 2025].

As of March 2026, Gemini 3.1 Pro Preview leads the leaderboard with around 45%. That is an impressive leap—and at the same time a warning sign: benchmarks that were considered insurmountable are being saturated faster than expected.

The creators themselves consider it realistic that models will exceed the 50% mark as early as 2025 [Center for AI Safety / Scale AI, 2025]. That sounds like a milestone. But HLE itself warns: High accuracy on HLE would not demonstrate autonomous research capability or “artificial general intelligence.” The test measures structured academic problems—not open-ended creativity or research.

Furthermore, right from the start, all models exhibited systematically high calibration errors. This means the models were very confident in their answers—even when they were wrong. An independent study by FutureHouse (July 2025) also suggested that around 30% of HLE answers to chemistry and biology questions could be incorrect [FutureHouse, 2025]. The test itself, therefore, has quality issues.

METR: Increasingly Rapid Autonomy—But Still Far From Critical Thresholds

METR does not measure percentages, but time spans. As of February 2026, the best model (Claude Opus 4.6) achieves a 50% time horizon of just under 14.5 hours [METR, 2026]. This means: For tasks that take humans about 14.5 hours, the model succeeds in solving them in half of the cases.

According to METR, the doubling time for this value is about seven months—an exponential trend that should be taken seriously [METR, 2025].

Regarding security, the current situation is as follows: None of the tested models demonstrate sufficient capabilities for autonomous self-replication or for taking over critical systems [METR, 2024/2025]. But the curve is clearly pointing upward.

Benchmark	What is measured	Current peak value	Directly relevant for UX?
HLE	Academic expert knowledge	~45% (Gemini 3.1 Pro, March 2026)	Hardly
METR Time Horizon	Time Horizon Autonomous action over time	~14.5 h (Claude Opus 4.6, Feb. 2026)	Indirectly

Why Benchmark Scores Can Mislead You as a UX Professional

This is the real crux of the matter. And I’m not saying this to downplay benchmarks—but because in my consulting work, I repeatedly see decisions about AI tool selection being made based on leaderboard positions. That’s roughly like hiring a surgeon because he solves crossword puzzles faster than others.

The measurement problem: Closed-ended questions vs. open contexts

HLE exclusively tests questions with a clear, verifiable answer. That’s methodologically sound—and appropriate for benchmarking purposes. But UX work operates differently.

If you ask a model to synthesize twenty user interviews and identify the key areas of tension, there is no “right” answer. If you need UX writing variations for an error message, what counts is tone, empathy, and brevity—not academic correctness. When you use a model as a sparring partner for test concepts, it needs the ability to recognize contradictions in context—not physics formulas.

METR itself reaches a sobering conclusion in its productivity study: benchmarks overestimate model performance because they only measure well-defined, algorithmically evaluable tasks [METR, 2025]. In reality, the requirements are more complex, the quality standards more implicit, and the context more extensive.

The Perception Trap: When “Feeling Good” Doesn’t Mean “Being Good”

In 2025, METR published a randomized controlled study on the productivity of experienced developers using AI support. The result was surprising: Participants estimated that AI would make them about 24% faster. After completing the tasks, they believed they had been about 20% faster. In fact, they were, on average, 19% slower [METR, 2025].

This isn’t a developer problem. It’s a perception problem—and it’s deeply UX. We know from research: People often evaluate processes based on the subjective experience, not the result. AI tools often feel productive because they respond quickly, because the interface is fluid, because the result seems “good enough.” This masks the actual quality.

For UX professionals, this is a doubly relevant finding: first, for their own use of tools, and second, for the design of AI products. Anyone integrating an AI feature into a product should keep this very gap between perceived and actual performance in mind.

What UX professionals should measure instead

If HLE and METR are of little use in everyday UX work—what then? Here are the dimensions that really matter in my work:

For UX research support:

Criteria	Why it matters
Summary quality	Does the model condense interview transcripts meaningfully without losing nuances?
Context fidelity	Does it stick to what was said—or does it invent plausible-sounding additions?
Nuance recognition	Does it understand ambivalence, contradictions, and emotional undertones?
Iteration capability	Can it respond to feedback and refine results in a targeted manner?
Hallucination rate	How often does it invent facts—and how noticeable is that?

For UX writing and content:

Criteria	Why it matters
Tone	Does it reliably capture different moods and brand voices?
Brevity vs. Clarity	Does it write concise microcopy without losing information?
Consistency	Does it maintain stylistic consistency across longer texts?

For AI product design:

Criteria	Why it matters
Error handling	How does the system react to unclear or contradictory inputs?
Transparency	Does it honestly communicate uncertainty—or does it feign certainty?
Expectation Management	Does it set realistic user expectations, or does it invite disappointment?

How to benchmark AI tools for your UX work yourself

The most useful benchmark is the one you develop yourself. That sounds more complicated than it is.

Collect 5–10 typical prompts from your daily work. No contrived examples—real tasks you regularly assign. Interview summaries, persona drafts, test scenarios, UX copy variations.
Choose one quality criterion per task. Not “Is it good?”, but: Is the summary complete without making things up? Does the tone match the brand? Are the user quotes correctly attributed?
Test the same prompt on two to three models and document the results. Not just once—but as a regular process, because models are constantly being updated.
Pay attention to Chatbot Arena results (LMArena). There, real users blindly evaluate which model responds better. This isn’t a perfect benchmark—but it’s closer to everyday reality than academic tests.
Document your experiences qualitatively. When was the result surprisingly good? When did the model make something up? When did the iteration fail? These observations are more valuable than any percentage on a leaderboard.

FAQ: Common Questions About AI Benchmarks and UX

Should I prefer models with high HLE scores?

Not automatically. HLE measures academic expert knowledge under controlled conditions. For UX tasks like interview synthesis, UX writing, or test design, other skills are crucial. A model with a moderate HLE score may be better suited for your specific tasks than the current frontrunner.

What does the METR Time Horizon mean for me as a UX professional?

Not much directly—but it’s relevant indirectly when you integrate AI agents or autonomous features into products. The Time Horizon shows how long a model can work independently on a task before it fails. This influences how much human oversight you need to plan for.

How can I tell if an AI tool is hallucinating for my UX work?

Hallucinations are often only detectable if you know the original input. For interview summaries: Specifically compare 2–3 specific statements with the transcript. For research assistance: Ask the model for its source—and verify it. No model is free of hallucinations, but the rate and detectability vary greatly.

How often should I review my choice of tools?

At least quarterly. Models are constantly updated—sometimes certain capabilities improve, sometimes they deteriorate after an update. Anyone who chooses a tool once and never questions it again may be working with outdated assumptions.

Are there benchmarks that are closer to real-world UX practice?

Yes. Chatbot Arena (LMArena) and MT-Bench measure conversational quality and instruction compliance—which is more relevant to UX tasks than academic tests. Neither is perfect, but they’re a better starting point than HLE scores for tool selection.

Conclusion

HLE and METR are important tools—for AI researchers, security experts, and regulators. For UX professionals who work with AI tools daily or design AI features for users, they say little about what really matters.

The key point: Benchmarks measure what is measurable—not what is important. And in UX work, the most important things are often hard to quantify: a model’s ability to stay in open dialogue, recognize nuances, deal honestly with uncertainty, and respond to feedback.

My advice: Look at benchmark results to get a rough idea. But build your own mini-benchmarks using real tasks from your daily work. And test them regularly—because models change faster than leaderboards are updated.

What’s been your best or worst experience with an AI tool in UX work so far? I’m curious.

About the Author

Tara Bosenick is a UX consultant and co-owner of Uintent. Since 1999, she has been helping companies make their products more user-friendly—using sound research methods and a clear eye for what matters most. As a speaker at conferences such as Mensch & Computer and the World Usability Congress, she shares her knowledge of UX and AI. Her workshops on UX-AI prompting and AI integration embody what makes for good UX: clear benefits, direct applicability—and enjoyment of the process.

💌 Not enough yet? Then read on—in our newsletter.

Comes out four times a year. Sticks with you longer. https://www.uintent.com/de/newsletter

Futuristic illustration of three floating AI tools: a glowing spark, a transparent workspace cube with layered documents, and a crystalline gear, connected by golden lines against a deep navy background.

Prompt, Project, or Skill? Which AI Tool Truly Accelerates Your UX Research

AI & UXR

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools

​

📌 Key Takeaways

Introduction

What are HLE and METR—and why should you know about them?

HLE—the toughest knowledge test there is

METR – Not Knowledge, but Autonomous Action

How do current models perform—and what do these numbers really mean?

HLE: From zero to nearly fifty percent in one year

METR: Increasingly Rapid Autonomy—But Still Far From Critical Thresholds

Why Benchmark Scores Can Mislead You as a UX Professional

The measurement problem: Closed-ended questions vs. open contexts

The Perception Trap: When “Feeling Good” Doesn’t Mean “Being Good”

What UX professionals should measure instead

For UX research support:

For UX writing and content:

For AI product design:

How to benchmark AI tools for your UX work yourself

FAQ: Common Questions About AI Benchmarks and UX

Conclusion

About the Author

💌 Not enough yet? Then read on—in our newsletter.

Prompt, Project, or Skill? Which AI Tool Truly Accelerates Your UX Research

UX Research As Risk Management: Why We Finally Need To Change Our Language

UX & AI: The Best Newsletters and Podcasts – My Personal Selection

Trust, but Verified: Why Medical Certification Matters for AR, VR, and Mr in Medtech

Making the Magic Usable: Why Usability Engineering Matters for AR, VR, and MR in Medtech

Reality, Reimagined: How AR, VR, and Mr Are Finding Their Way Into Medtech

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools

NotebookLM in UX Research: An Honest Assessment of a Specialized AI Tool

Introducing Gated Salami Prompting: Why You Should Slice Complex LLM Tasks Into Smaller Pieces

Fictitious Quotes, Lost Nuances: The Hallucination Problem in Qualitative Analysis With Llms

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

Prompt Psychology Exposed: Why “Tipping” ChatGPT Sometimes Works

System Prompts in UX Research: What You Need to Know About Invisible AI Control

Summarizing YouTube Videos With AI: Three Tools Put to the Test in UX Research

UX For a Better World: We Are Giving Away a UX Research Project to Non-profit Organisations and Sustainable Companies!

AI Tools UX Research: How Do These Tools Handle Large Documents?

Donald Trump Prompt: How Provocative AI Prompts Affect UX Budgets

The Final Hurdle: How Unsafe Automation Undermines Trust in Adas

Will AI Replace UX Jobs? What a Study of 200,000 AI Conversations Really Shows

The Passenger Who Always Listens: Why We Are Reluctant to Trust Our Cars When They Talk

Related Articles you might enjoy

AUTHOR