top of page
uintent company logo

AI & UXR

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools


6

MIN

Mar 26, 2026

📌 Key Takeaways

  • HLE (Humanity’s Last Exam) tests academic expertise—it’s not a practical test for UX work.

  • METR measures how autonomously AI systems can operate—relevant for product design, not for tool selection.

  • Models’ HLE scores rose from under 10% to nearly 50% in one year—benchmarks become obsolete rapidly.

  • A METR study showed: Developers were 19% slower with AI—but believed they were faster.

  • Benchmark scores systematically overestimate the everyday performance of AI tools.

  • For UX work, other criteria matter: context fidelity, iterability, hallucination rate.

  • The most useful benchmark for you is your own workflow—tested with real tasks.


Introduction

Last week, an AI model broke another record on one of the toughest AI tests. And maybe you caught a glimpse of it—as a LinkedIn post, as a tech newsletter snippet—and then asked yourself: What does this actually mean for my work?


Good question. The answer is usually: less than you think.


In this article, I explain what lies behind two of the most talked-about AI benchmarks today—Humanity’s Last Exam (HLE) and METR—and why, while their results are fascinating, they have limited relevance for everyday UX work. You’ll also learn which criteria are truly relevant if you want to use AI tools for UX research or UX AI prompting.


I’ve been working in the UX field since 1999 and have been observing how AI is transforming the industry for several years now. What keeps coming back to me is this: the discussion about AI capabilities often takes place in a language that has little to do with the day-to-day reality of UX professionals. I’d like to change that here.


What are HLE and METR—and why should you know about them?

Before we put the numbers into context, we need clarity on the terms. HLE and METR measure very different things—and both are only indirectly related to what UX professionals do on a daily basis.


HLE—the toughest knowledge test there is

Humanity’s Last Exam (HLE) is a benchmark developed jointly by the Center for AI Safety and Scale AI. It consists of 2,500 expert-level questions—covering mathematics, physics, chemistry, biology, computer science, and the humanities. The questions were contributed by researchers and doctoral students from over 500 institutions worldwide.


The name says it all: HLE is intended to be the last academic benchmark of its kind—because previous tests like MMLU are now solved by AI models with over 90% accuracy and are therefore hardly meaningful anymore.


What makes HLE special: The questions are phrased in such a way that you can’t simply Google them. A model must demonstrate genuine reasoning, not pattern recognition. The answers are unambiguous and automatically evaluable—either correct or incorrect.


An example of the difficulty level (hypothetical scenario): “How many paired tendons are supported by a specific sesamoid bone in the tail musculature of hummingbirds?” This is not a question a model knows from training. It must reason.


METR – Not Knowledge, but Autonomous Action

METR (Model Evaluation & Threat Research) is a non-profit organization in Berkeley that measures something fundamentally different: not what a model knows, but what it can do on its own.


The key metric is the so-called “Time Horizon”—the task duration at which an AI agent has a 50% probability of successfully completing a task. This is measured using real-world software tasks that typically take humans minutes to hours to complete [METR, 2025].


The focus is on security issues: Can a model autonomously acquire resources? Can it replicate itself? Can it perform tasks over many hours without human supervision? This is relevant for AI security research and for companies that want to deploy autonomous AI agents.


How do current models perform—and what do these numbers really mean?

The numbers are impressive—but should be taken with a grain of salt.


HLE: From zero to nearly fifty percent in one year

At the time of publication in January 2025, all tested models were below 10%: GPT-4o achieved 3.3%, Claude 3.5 Sonnet 4.3%, and the then-leading model o1 around 9% [Scale AI, 2025].


As of March 2026, Gemini 3.1 Pro Preview leads the leaderboard with around 45%. That is an impressive leap—and at the same time a warning sign: benchmarks that were considered insurmountable are being saturated faster than expected.


The creators themselves consider it realistic that models will exceed the 50% mark as early as 2025 [Center for AI Safety / Scale AI, 2025]. That sounds like a milestone. But HLE itself warns: High accuracy on HLE would not demonstrate autonomous research capability or “artificial general intelligence.” The test measures structured academic problems—not open-ended creativity or research.


Furthermore, right from the start, all models exhibited systematically high calibration errors. This means the models were very confident in their answers—even when they were wrong. An independent study by FutureHouse (July 2025) also suggested that around 30% of HLE answers to chemistry and biology questions could be incorrect [FutureHouse, 2025]. The test itself, therefore, has quality issues.


METR: Increasingly Rapid Autonomy—But Still Far From Critical Thresholds

METR does not measure percentages, but time spans. As of February 2026, the best model (Claude Opus 4.6) achieves a 50% time horizon of just under 14.5 hours [METR, 2026]. This means: For tasks that take humans about 14.5 hours, the model succeeds in solving them in half of the cases.


According to METR, the doubling time for this value is about seven months—an exponential trend that should be taken seriously [METR, 2025].


Regarding security, the current situation is as follows: None of the tested models demonstrate sufficient capabilities for autonomous self-replication or for taking over critical systems [METR, 2024/2025]. But the curve is clearly pointing upward.


Benchmark

What is measured

Current peak value

Directly relevant for UX?

HLE

Academic expert knowledge

~45% (Gemini 3.1 Pro, March 2026)

Hardly

METR Time Horizon

Time Horizon Autonomous action over time

~14.5 h (Claude Opus 4.6, Feb. 2026)

Indirectly


Why Benchmark Scores Can Mislead You as a UX Professional

This is the real crux of the matter. And I’m not saying this to downplay benchmarks—but because in my consulting work, I repeatedly see decisions about AI tool selection being made based on leaderboard positions. That’s roughly like hiring a surgeon because he solves crossword puzzles faster than others.


The measurement problem: Closed-ended questions vs. open contexts

HLE exclusively tests questions with a clear, verifiable answer. That’s methodologically sound—and appropriate for benchmarking purposes. But UX work operates differently.


If you ask a model to synthesize twenty user interviews and identify the key areas of tension, there is no “right” answer. If you need UX writing variations for an error message, what counts is tone, empathy, and brevity—not academic correctness. When you use a model as a sparring partner for test concepts, it needs the ability to recognize contradictions in context—not physics formulas.


METR itself reaches a sobering conclusion in its productivity study: benchmarks overestimate model performance because they only measure well-defined, algorithmically evaluable tasks [METR, 2025]. In reality, the requirements are more complex, the quality standards more implicit, and the context more extensive.


The Perception Trap: When “Feeling Good” Doesn’t Mean “Being Good”

In 2025, METR published a randomized controlled study on the productivity of experienced developers using AI support. The result was surprising: Participants estimated that AI would make them about 24% faster. After completing the tasks, they believed they had been about 20% faster. In fact, they were, on average, 19% slower [METR, 2025].


This isn’t a developer problem. It’s a perception problem—and it’s deeply UX. We know from research: People often evaluate processes based on the subjective experience, not the result. AI tools often feel productive because they respond quickly, because the interface is fluid, because the result seems “good enough.” This masks the actual quality.


For UX professionals, this is a doubly relevant finding: first, for their own use of tools, and second, for the design of AI products. Anyone integrating an AI feature into a product should keep this very gap between perceived and actual performance in mind.


What UX professionals should measure instead

If HLE and METR are of little use in everyday UX work—what then? Here are the dimensions that really matter in my work:


For UX research support:

Criteria

Why it matters

Summary quality

Does the model condense interview transcripts meaningfully without losing nuances?

Context fidelity

Does it stick to what was said—or does it invent plausible-sounding additions?

Nuance recognition

Does it understand ambivalence, contradictions, and emotional undertones?

Iteration capability

Can it respond to feedback and refine results in a targeted manner?

Hallucination rate

How often does it invent facts—and how noticeable is that?


For UX writing and content:

Criteria

Why it matters

Tone

Does it reliably capture different moods and brand voices?

Brevity vs. Clarity

Does it write concise microcopy without losing information?

Consistency

Does it maintain stylistic consistency across longer texts?


For AI product design:

Criteria

Why it matters

Error handling

How does the system react to unclear or contradictory inputs?

Transparency

Does it honestly communicate uncertainty—or does it feign certainty?

Expectation Management

Does it set realistic user expectations, or does it invite disappointment?


How to benchmark AI tools for your UX work yourself

The most useful benchmark is the one you develop yourself. That sounds more complicated than it is.


  1. Collect 5–10 typical prompts from your daily work. No contrived examples—real tasks you regularly assign. Interview summaries, persona drafts, test scenarios, UX copy variations.

  2. Choose one quality criterion per task. Not “Is it good?”, but: Is the summary complete without making things up? Does the tone match the brand? Are the user quotes correctly attributed?

  3. Test the same prompt on two to three models and document the results. Not just once—but as a regular process, because models are constantly being updated.

  4. Pay attention to Chatbot Arena results (LMArena). There, real users blindly evaluate which model responds better. This isn’t a perfect benchmark—but it’s closer to everyday reality than academic tests.

  5. Document your experiences qualitatively. When was the result surprisingly good? When did the model make something up? When did the iteration fail? These observations are more valuable than any percentage on a leaderboard.


FAQ: Common Questions About AI Benchmarks and UX

Should I prefer models with high HLE scores? 

Not automatically. HLE measures academic expert knowledge under controlled conditions. For UX tasks like interview synthesis, UX writing, or test design, other skills are crucial. A model with a moderate HLE score may be better suited for your specific tasks than the current frontrunner.


What does the METR Time Horizon mean for me as a UX professional? 

Not much directly—but it’s relevant indirectly when you integrate AI agents or autonomous features into products. The Time Horizon shows how long a model can work independently on a task before it fails. This influences how much human oversight you need to plan for.


How can I tell if an AI tool is hallucinating for my UX work? 

Hallucinations are often only detectable if you know the original input. For interview summaries: Specifically compare 2–3 specific statements with the transcript. For research assistance: Ask the model for its source—and verify it. No model is free of hallucinations, but the rate and detectability vary greatly.


How often should I review my choice of tools? 

At least quarterly. Models are constantly updated—sometimes certain capabilities improve, sometimes they deteriorate after an update. Anyone who chooses a tool once and never questions it again may be working with outdated assumptions.


Are there benchmarks that are closer to real-world UX practice? 

Yes. Chatbot Arena (LMArena) and MT-Bench measure conversational quality and instruction compliance—which is more relevant to UX tasks than academic tests. Neither is perfect, but they’re a better starting point than HLE scores for tool selection.


Conclusion

HLE and METR are important tools—for AI researchers, security experts, and regulators. For UX professionals who work with AI tools daily or design AI features for users, they say little about what really matters.


The key point: Benchmarks measure what is measurable—not what is important. And in UX work, the most important things are often hard to quantify: a model’s ability to stay in open dialogue, recognize nuances, deal honestly with uncertainty, and respond to feedback.


My advice: Look at benchmark results to get a rough idea. But build your own mini-benchmarks using real tasks from your daily work. And test them regularly—because models change faster than leaderboards are updated.


What’s been your best or worst experience with an AI tool in UX work so far? I’m curious.


About the Author

Tara Bosenick is a UX consultant and co-owner of Uintent. Since 1999, she has been helping companies make their products more user-friendly—using sound research methods and a clear eye for what matters most. As a speaker at conferences such as Mensch & Computer and the World Usability Congress, she shares her knowledge of UX and AI. Her workshops on UX-AI prompting and AI integration embody what makes for good UX: clear benefits, direct applicability—and enjoyment of the process.


💌 Not enough yet? Then read on—in our newsletter.

Comes out four times a year. Sticks with you longer. https://www.uintent.com/de/newsletter

Futuristic digital illustration: A glowing golden certification seal floating against a deep navy background, surrounded by AR interface fragments and a faint headset silhouette – symbolizing trust and validation in medical technology.

Trust, but Verified: Why Medical Certification Matters for AR, VR, and Mr in Medtech

HEALTHCARE, HUMAN-CENTERED DESIGN, UX

Floating semi-transparent AR interface with minimal medical data and anatomical visuals, glowing in cyan and gold against a dark futuristic background.

Making the Magic Usable: Why Usability Engineering Matters for AR, VR, and MR in Medtech

HEALTHCARE, MHEALTH

A futuristic, symbolic illustration shows a person standing on a glowing bridge between two worlds: on the left, a warmly lit hospital room with a bed and medical equipment; on the right, an immersive digital space featuring a holographic human body with organs glowing in cyan and orange tones. Both sides are connected by flowing streams of light, set against a deep navy blue background with soft violet transitions.

Reality, Reimagined: How AR, VR, and Mr Are Finding Their Way Into Medtech

DIGITISATION, HEALTHCARE

A glowing golden trophy floats above a gap, while small figures below work on user research and wireframes, untouched by its light.

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools

AI & UXR

Futuristic digital illustration on a deep navy background: a human hand holding a warm glowing pencil and a cyan-lit robotic hand both reach toward a radiant central data cluster. Surrounded by stacked documents and a network of connected nodes, the scene symbolizes collaboration between human interpretation and digital information processing.

NotebookLM in UX Research: An Honest Assessment of a Specialized AI Tool

AI & UXR, HOW-TO, LLM

Futuristic glowing cylinder divided into segments by golden barriers.

Introducing Gated Salami Prompting: Why You Should Slice Complex LLM Tasks Into Smaller Pieces

CHAT GPT, HOW-TO, LLM, PROMPTS

Futuristic square illustration on deep navy background: a glowing golden speech bubble dissolves into particles that partially reassemble incorrectly, surrounded by energy arcs, luminous nodes, and a stylized digital head—symbolizing LLM hallucinations.

Fictitious Quotes, Lost Nuances: The Hallucination Problem in Qualitative Analysis With Llms

CHAT GPT, HOW-TO, LLM, OPEN AI, PROMPTS, TOKEN, UX METHODS

Surreal futuristic illustration of a glowing digital head with data streams, charts, and evaluation symbols representing AI evaluation methodology.

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

AI WRITING, DIGITISATION, HOW-TO, PROMPTS

A surreal, futuristic illustration featuring a translucent human profile with a glowing brain connected by flowing data streams to a hovering, golden crystal.

Prompt Psychology Exposed: Why “Tipping” ChatGPT Sometimes Works

CHAT GPT, HOW-TO, LLM, UX

Surreal, futuristic illustration of a person seen from behind standing in a glowing digital cityscape.

System Prompts in UX Research: What You Need to Know About Invisible AI Control

PROMPTS, RESEARCH, UX, UX INSIGHTS

Abstract futuristic illustration of a person, various videos, and notes.

Summarizing YouTube Videos With AI: Three Tools Put to the Test in UX Research

LLM, UX, HOW-TO

two folded hands holding a growing plant

UX For a Better World: We Are Giving Away a UX Research Project to Non-profit Organisations and Sustainable Companies!

UX INSIGHTS, UX FOR GOOD, TRENDS, RESEARCH

Abstract futuristic illustration of a person facing a glowing tower of documents and flowing data streams.

AI Tools UX Research: How Do These Tools Handle Large Documents?

LLM, CHAT GPT, HOW-TO

Illustration of Donald Trump with raised hand in front of an abstract digital background suggesting speech bubbles and data structures.

Donald Trump Prompt: How Provocative AI Prompts Affect UX Budgets

AI & UXR, PROMPTS, STAKEHOLDER MANAGEMENT

Driver's point of view looking at a winding country road surrounded by green vegetation. The steering wheel, dashboard and rear-view mirror are visible in the foreground.

The Final Hurdle: How Unsafe Automation Undermines Trust in Adas

AUTOMATION, AUTOMOTIVE UX, AUTONOMOUS DRIVING, GAMIFICATION, TRENDS

Illustration of a person standing at a fork in the road with two equal paths.

Will AI Replace UX Jobs? What a Study of 200,000 AI Conversations Really Shows

HUMAN VS AI, RESEARCH, AI & UXR

Close-up of a premium tweeter speaker in a car dashboard with perforated metal surface.

The Passenger Who Always Listens: Why We Are Reluctant to Trust Our Cars When They Talk

AUTOMOTIVE UX, VOICE ASSISTANTS

Keyhole in a dark surface revealing an abstract, colorful UX research interface.

Evaluating AI Results in UX Research: How to Navigate the Black Box

AI & UXR, HOW-TO, HUMAN VS AI

A car cockpit manufactured by Audi. It features a digital display and numerous buttons on the steering wheel.

Haptic Certainty vs. Digital Temptation: The Battle for the Best Controls in Cars

AUTOMOTIVE UX, AUTONOMOUS DRIVING, CONNECTIVITY, GAMIFICATION

Digital illustration of a classical building facade with columns, supported by visible scaffolding, symbolising a fragile, purely superficial front.

UX & AI: How "UX Potemkin" Undermines Your Research and Design Decisions

AI & UXR, HUMAN VS AI, LLM, UX

 RELATED ARTICLES YOU MIGHT ENJOY 

AUTHOR

Tara Bosenick

Tara has been active as a UX specialist since 1999 and has helped to establish and shape the industry in Germany on the agency side. She specialises in the development of new UX methods, the quantification of UX and the introduction of UX in companies.


At the same time, she has always been interested in developing a corporate culture in her companies that is as ‘cool’ as possible, in which fun, performance, team spirit and customer success are interlinked. She has therefore been supporting managers and companies on the path to more New Work / agility and a better employee experience for several years.


She is one of the leading voices in the UX, CX and Employee Experience industry.

bottom of page