top of page
uintent company logo

CHAT GPT, HOW-TO, LLM, OPEN AI, PROMPTS, TOKEN, UX METHODS

Fictitious Quotes, Lost Nuances: The Hallucination Problem in Qualitative Analysis With Llms

8

MIN

Mar 5, 2026

Imagine you are presenting your stakeholders with a user statement from recent interviews. The citation format is correct, the content sounds plausible – but no one actually said that. The LLM reconstructed it.


This is not a science fiction scenario. It is an empirically proven problem that is quietly seeping into qualitative research practice.


In my work as a UX consultant, I have been observing for months how LLMs are increasingly being used to analyze interview transcripts. The promise is tempting: faster evaluation, less manual work, more interviews in less time. But how reliable are the results really? And what happens if we trust them too much?


In this article, we look at what current benchmarks—most notably the new HalluHard benchmark—show about the hallucination rate of LLMs, why multi-turn dialogues dramatically exacerbate the problem, and what this means specifically for UX research and UX AI prompting.


📌 The most important points in brief

  • Even the best LLMs hallucinate in realistic, multi-step scenarios in around 30% of cases – as shown by the new HalluHard benchmark.

  • Previous benchmarks (e.g., TruthfulQA, SimpleQA) are too simple and saturated – they do not reflect the real risks.

  • LLMs “regenerate” quotes from transcripts instead of copying them – sometimes with deviations that alter their meaning.

  • Omissions are the bigger problem: what an LLM leaves out is often more important for qualitative analysis than what it invents.

  • Multi-turn dialogues exacerbate everything: errors from early turns become the basis for later responses.

  • The solution lies not in more technology, but in greater methodological awareness—combined with smart prompt architecture.

  • Reading transcripts yourself remains irreplaceable. No prompt protects as well as your own familiarity with the data.


Why old benchmarks paint an overly optimistic picture

LLMs and hallucinations—the topic is not new. Benchmarks have long been used to measure the reliability of language models. But the crux of the matter is that most of these benchmarks test scenarios that have little to do with real-world use cases.


TruthfulQA (Lin et al., 2021) was long the standard: 817 short questions on common misconceptions, measured in a single-turn format. The problem is that it is now largely saturated. Analyses show that the benchmark is compromised by its inclusion in training data, and the multiple-choice variant can be “tricked” with simple heuristics to achieve almost 80% accuracy – without actually answering the question.


SimpleQA (OpenAI, 2024) was intended as a tougher alternative: 4,326 factual questions with clear, time-stable answers. But current frontier models with web search achieve almost 100% accuracy there – SimpleQA is also practically saturated.


The pattern is always the same: as soon as a benchmark is published, models optimize for it. And as soon as they optimize, it no longer reflects real reality.

What is missing in all these tests is the complexity of real-world use cases. Open-ended answers. Multiple rounds of conversation. Niche knowledge. Sources that actually have to exist.


HalluHard: What happens when benchmarks become realistic?

This is exactly where HalluHard comes in – a new hallucination benchmark developed by researchers at EPFL and the ELLIS Institute / Max Planck Institute Tübingen.


The design is fundamentally different from previous benchmarks:

  • Multi-turn dialogues with 3 rounds of conversation instead of single-turn

  • 950 seed questions from four challenging domains: legal cases, research questions, medical guidelines, and coding

  • Two-stage verification: First, whether a cited reference actually exists (reference grounding). Second, whether the content actually supports the claim (content grounding) – including PDF parsing


The result is sobering: even the best available model configurations with web search hallucinate in around 30% of cases. By comparison, with SimpleQA with web search, the same models are close to zero.


Two further findings from HalluHard are particularly relevant for UX AI Research:

First: Error accumulation over turns. Models hallucinate significantly more often in later rounds of conversation because they build on their own previous errors. 3–20% of incorrect references reappear in later turns. The model essentially believes itself.


Second: The dangerous middle zone. When faced with completely unknown knowledge, models often abstain. When faced with niche knowledge—i.e., where they “somehow” know something but not enough—they fill in the gaps with plausible-sounding but incorrect details. This is precisely the zone where qualitative interview data often lies.


Why multi-turn dialogues particularly exacerbate the problem for UX research

This is where it gets concrete for us as UX researchers. A typical analysis workflow with LLMs is not a single turn. It looks more like this:


  1. “Here is the transcript of interview 3. What are the main topics?”

  2. “Compare that with the topics from interviews 1 and 2.”

  3. “What contradictions do you see between the participants?”

  4. “Create an overview of the key insights with quotes.”


This is a classic multi-turn dialogue – and exactly the setting in which HalluHard has demonstrated error accumulation.


If the model slightly misweights a topic in turn 1, this error becomes the premise for comparison in turn 2. In turn 3, “contradictions” are constructed on the basis of an already distorted analysis, which may not even exist. In turn 4, quotes are selected that support a narrative that has become established – but is skewed.


The tricky thing is that with each turn, the result becomes more coherent and convincing. The model consistently continues to tell its own story – while the connection to the actual data becomes increasingly fragile.


Added to this is a methodological problem that we know from qualitative research: confirmation bias. If you specify a certain direction in Turn 2 – for example, “I believe frustration is a recurring theme” – the model will tend to confirm this hypothesis rather than question it. It has no inner spirit of contradiction.


Four specific problem areas in qualitative analysis

1. Invented quotes: regeneration instead of extraction

This is perhaps the most tangible problem. A study by the Learning Analytics Community, which automatically compared LLM-generated quotes with original transcripts, initially found that 7.7% of the quotes provided could not be found in the original [CEUR-WS.org]. Closer analysis revealed that many were not complete fabrications, but “regenerations” – the model had omitted filler words, changed punctuation, and slightly reworded sentences.


This may be tolerable for quantitative analyses. For qualitative research, however, it is a fundamental problem: an omitted “so,” a condensed sentence, a changed emphasis—these can make all the difference in interpretation. If you include a quote in a stakeholder presentation as an authentic user voice when it is actually an AI paraphrase, that is no small matter.


2. Omissions: The invisible problem

Studies consistently show that LLMs omit more than they invent. An analysis of clinical transcript summaries (npj Digital Medicine, 2025) found a hallucination rate of 1.47%—but an omission rate of 3.45%. Omissions were more than twice as common as inventions.


This is particularly treacherous for UX research. It is often the subtle signals – a hesitation, a contradiction in the middle of an answer, a casual remark that identifies the real problem – that provide the most valuable insights. An LLM optimizes for coherence. It gives you what looks “typical” and filters out anything that deviates from that. This is the exact opposite of what good qualitative analysis should do.


3. Smoothing instead of depth

LLM outputs are linguistically polished, consistent, and professional. That sounds good—but it's not always the case. At first glance, an LLM-generated topic overview seems more convincing than handwritten notes with question marks and arrows. But it is often less analytical. The danger is that stakeholders place more trust in the result than it deserves—because it sounds professional, not because it is methodologically deep.


4. Cultural and linguistic nuances

A study of focus group transcripts from Kenya (Nature, Scientific Reports, 2025) showed that GPT-4o had considerable difficulty with idiomatic expressions and culturally embedded language. The hallucinations ranged from individual word changes to combinations of text passages that altered the meaning. In the UX context, this means that when participants use technical jargon, colloquial language, or vague expressions, the model translates this into “standard language” – and loses meaning in the process.


The real problem: an industry on the threshold of trust

So far, this sounds like a technical problem with technical solutions. But it's more than that.


UX research thrives on the credibility of qualitative data. When stakeholders are presented with quotes that no one actually said – or when topics are prioritized because an LLM has smoothed out the nuances – it undermines more than just individual projects. It undermines trust in qualitative research as a whole.


Especially now, when organizations are under cost pressure and there is a great temptation to see LLMs as a cheaper substitute for thorough analysis, there needs to be a clear awareness of these limitations. And that awareness is still alarmingly lacking.


A research group conducting thematic analyses of software engineering interviews (arXiv) needed eight prompt iterations before LLM outputs were methodologically acceptable – and even then, manual comparison with line and segment numbers was absolutely necessary. This is not a workflow that can be used casually.


The risk is real: if a discipline that derives its weight from its proximity to real user voices replaces those voices with AI paraphrases – without realizing or communicating it – it loses the basis of its legitimacy.


Practical tips: How to use LLMs responsibly in transcript analysis

That doesn't mean stay away from LLMs. It means use them wisely. Here are the strategies I recommend and use myself:


1. Star model instead of long chat Analyze each interview in a separate, fresh chat – without knowledge of the other interviews. You do the synthesis yourself. You are the authority who recognizes the patterns between the interviews because you know the interviews. The model only sees its own previous output.


2. Re-grounding in later turns In later analysis steps, don't trust the chat history; instead, provide the original transcript again. Instead of “Based on our analysis so far...,” say, “Here is the complete transcript again. Check whether the following topics are actually supported by the text.”


3. Number transcripts in advance Number transcripts with line numbers and ask the LLM to specify the source line for each code and each quote. This creates traceability – and makes hallucinations visible before they make their way into the report.


4. Counterfactual prompts After each analysis phase, explicitly ask the opposite question: “What in the data contradicts these topics?” or “Which statements do not fit into any of these categories?” This forces the model to critically examine its own previous output – and can at least slow down the tendency to spin consistently.


5. Require uncertainty marking Instruct the model to distinguish between statements directly supported by quotes and derived interpretations. A prompt such as “Clearly mark what is directly in the transcript and what you are interpreting” helps to keep the boundary visible.


6. Incorporate an omission check Ask explicitly: “Which passages of the transcript do not appear in any of the topics mentioned?” This is exactly where the surprising insights could lie – the passages that the model filtered out because they did not fit the pattern.


7. And the most important tip: Read the transcripts yourself. Always. Several studies show that researchers who conducted their own interviews and read the transcripts themselves are much better at recognizing LLM errors. This familiarity with the data is not overhead that should be optimized away – it is the analytical process itself. No prompt protects as well.


FAQ

Are LLMs fundamentally unsuitable for qualitative UX analysis?

No – but their use must be clearly limited and methodically reflected upon. LLMs can be useful for structuring, clustering, and providing a quick overview. When it comes to interpretation, quote selection, or identifying contradictions, a human verification layer is always needed.


Why is the measured hallucination rate (1–2%) so low in some studies when HalluHard shows 30%? Because the test conditions are fundamentally different. Studies with low rates often work with clearly defined tasks, predefined codebooks, and single-turn settings.


HalluHard simulates more realistic scenarios: open-ended answers, citation requirements, multiple turns, niche domains. The gap between benchmark performance and real-world application is the real problem.


What about specialized tools for qualitative analysis—are they better?

In some cases, yes. Tools such as DeTAILS (Deep Thematic Analysis with Iterative LLM Support) incorporate systematic comparisons – every quote generated by the LLM is automatically checked against the original text and rejected if it does not appear verbatim.


This is an important step. But even these tools cannot replace the analytical judgment of researchers.


Does this also apply if I only have very long transcripts summarized?

Yes – the omission rate is also a problem there. Studies show that omissions are often twice as common as fabrications. Especially when summarizing, there is a risk that subtle signals and contradictions will be filtered out because the model is optimized for coherence and completeness.


How do I explain to my stakeholders why I still check LLM outputs manually?

Simply and honestly: Qualitative data is the basis for design decisions. When we present quotes, they must be accurate. No company would make decisions based on production data without verifying it – and user quotes are our production data.


Conclusion: A tool, not a substitute

LLMs can accelerate qualitative analysis. They can structure, cluster, and provide an initial overview. But they cannot replace what constitutes good qualitative research: proximity to the data, a sense for the unexpected, and a willingness to repeatedly check interpretations against the original.


The findings from HalluHard and the transcript studies show that the problem is not marginal. It is structural. And it becomes more visible the more complex and realistic the application scenarios become.


Those who ignore this risk more than just poor research. The industry risks damaging the foundation of its credibility—at the very moment when it wants to grow the fastest.

My appeal: Use LLMs as an aid. Use them wisely. But keep control over what matters: the real voice of your users.


About the author

Tara Bosenick is a UX consultant and co-owner of Uintent. Since 1999, she has been helping companies make their products more user-friendly – with sound research methods and a clear eye for the essentials. As a speaker at conferences such as Mensch & Computer and the World Usability Congress, she shares her knowledge of UX and AI. Her workshops on UX-AI prompting and AI integration cover what makes good UX: clear benefits, direct applicability, and enjoyment of the process.


💌 Want more? Then read on—in our newsletter.

Published four times a year. Sticks in your mind longer.https://www.uintent.com/de/newsletter

A glowing golden trophy floats above a gap, while small figures below work on user research and wireframes, untouched by its light.

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools

AI & UXR

Futuristic digital illustration on a deep navy background: a human hand holding a warm glowing pencil and a cyan-lit robotic hand both reach toward a radiant central data cluster. Surrounded by stacked documents and a network of connected nodes, the scene symbolizes collaboration between human interpretation and digital information processing.

NotebookLM in UX Research: An Honest Assessment of a Specialized AI Tool

AI & UXR, HOW-TO, LLM

Futuristic glowing cylinder divided into segments by golden barriers.

Introducing Gated Salami Prompting: Why You Should Slice Complex LLM Tasks Into Smaller Pieces

CHAT GPT, HOW-TO, LLM, PROMPTS

Surreal futuristic illustration of a glowing digital head with data streams, charts, and evaluation symbols representing AI evaluation methodology.

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

AI WRITING, DIGITISATION, HOW-TO, PROMPTS

Surreal, futuristic illustration of a person seen from behind standing in a glowing digital cityscape.

System Prompts in UX Research: What You Need to Know About Invisible AI Control

PROMPTS, RESEARCH, UX, UX INSIGHTS

Abstract futuristic illustration of a person, various videos, and notes.

Summarizing YouTube Videos With AI: Three Tools Put to the Test in UX Research

LLM, UX, HOW-TO

Abstract futuristic illustration of a person facing a glowing tower of documents and flowing data streams.

AI Tools UX Research: How Do These Tools Handle Large Documents?

LLM, CHAT GPT, HOW-TO

Illustration of Donald Trump with raised hand in front of an abstract digital background suggesting speech bubbles and data structures.

Donald Trump Prompt: How Provocative AI Prompts Affect UX Budgets

AI & UXR, PROMPTS, STAKEHOLDER MANAGEMENT

Illustration of a person standing at a fork in the road with two equal paths.

Will AI Replace UX Jobs? What a Study of 200,000 AI Conversations Really Shows

HUMAN VS AI, RESEARCH, AI & UXR

Keyhole in a dark surface revealing an abstract, colorful UX research interface.

Evaluating AI Results in UX Research: How to Navigate the Black Box

AI & UXR, HOW-TO, HUMAN VS AI

Silhouette of a diver descending into deep blue water – a metaphor for in-depth research.

Deep Research AI | How to use ChatGPT effectively for UX work

CHAT GPT, HOW-TO, RESEARCH, AI & UXR

A referee holds up a scorecard labeled “Yupp.ai” between two stylized AI chatbots in a boxing ring – a symbolic image for fair user-based comparison of AI models.

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

AI & UXR, CHAT GPT, HUMAN VS AI, LLM

3D illustration of a digital marketplace with colorful prompt stalls and a figure selecting a prompt card.

Buying, sharing, selling prompts – what prompt marketplaces offer today (and why this is relevant for UX)

AI & UXR, PROMPTS

Robot holds two signs: “ISO 9241 – 7 principles” and “ISO 9241 – 10 principles”

ChatGPT Hallucinates – Despite Anti-Hallucination Prompt

AI & UXR, HUMAN VS AI, CHAT GPT

Strawberry being sliced by a knife, stylized illustration.

Why AI Sometimes Can’t Count to 3 – And What That Has to Do With Tokens

AI & UXR, TOKEN, LLM

Square motif divided in the middle: on the left, a grey, stylised brain above a seated person working on a laptop in dark grey tones; on the right, a bright blue, networked brain above a standing person in front of a holographic interface on a dark background.

GPT-5 Is Here: Does This UX AI Really Change Everything for Researchers?

AI & UXR, CHAT GPT

Surreal AI image with data streams, crossed-out “User Expirince” and the text “ChatGPT kann jetzt Text in Bild”.

When AI Paints Pictures – And Suddenly Knows How to Spell

AI & UXR, CHAT GPT, HUMAN VS AI

Human and AI co-create a glowing tree on the screen, set against a dark, surreal background.

When the Text Is Too Smooth: How to Make AI Language More Human

AI & UXR, AI WRITING, CHAT GPT, HUMAN VS AI

Futuristic illustration: Human facing a glowing humanoid AI against a digital backdrop.

Not Science Fiction – AI Is Becoming Independent

AI & UXR, CHAT GPT

Illustration of an AI communicating with a human, symbolizing the persuasive power of artificial intelligence.

Between Argument and Influence – How Persuasive Can AI Be?

AI & UXR, CHAT GPT, LLM

 RELATED ARTICLES YOU MIGHT ENJOY 

AUTHOR

Tara Bosenick

Tara has been active as a UX specialist since 1999 and has helped to establish and shape the industry in Germany on the agency side. She specialises in the development of new UX methods, the quantification of UX and the introduction of UX in companies.


At the same time, she has always been interested in developing a corporate culture in her companies that is as ‘cool’ as possible, in which fun, performance, team spirit and customer success are interlinked. She has therefore been supporting managers and companies on the path to more New Work / agility and a better employee experience for several years.


She is one of the leading voices in the UX, CX and Employee Experience industry.

bottom of page