top of page
uintent company logo

AI WRITING, DIGITISATION, HOW-TO, PROMPTS

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

8

MIN

Feb 26, 2026

Imagine this: You've coded 40 user interviews with a carefully crafted prompt. The results look plausible. The team is satisfied, and the project moves forward. But then you quietly ask yourself: Would the same results have come out yesterday? And if my colleague uses the same prompt, will he come to the same conclusion? What will happen after the next model update?


These questions are not hypothetical. They affect every UX team that uses AI-powered analysis—which is quite a few of us by now.


In my work as a UX consultant, I have seen a rapid shift since 2023: prompts are becoming an analysis tool. For sentiment analysis, for coding qualitative data, for synthesizing user feedback. The results are often impressive. But in doing so, we are skipping a step that we would naturally require for any other research tool: checking the quality criteria.


This article highlights the problem. It does not provide a ready-made solution – but it does ask the questions that we as a UX research community urgently need to answer.


📌 The most important points in brief

  • Prompts are measuring instruments – and need the same methodological testing as questionnaires or test protocols.

  • Reliability (stability, robustness, model independence) can be tested partially automatically – but hardly anyone does so.

  • Validity (does the prompt measure the right thing?) is the more difficult and important question—and currently completely unresolved.

  • Existing evaluation frameworks from engineering solve sub-problems but do not systematically address validity.

  • UX research has the methodological knowledge to address this problem—but does not yet apply it to prompts.

  • Without quality criteria for prompts, we risk basing decisions on untested analyses.


Why should we care about quality criteria for prompts?

Because as UX researchers, we make decisions based on data – and the quality of that data depends directly on the quality of our instruments.


With a questionnaire, we ask: Is it reliable? Is it valid? Has it been piloted? With a usability test, we check whether the tasks measure what we want to measure. But with a prompt that categorizes 200 customer reviews according to pain points? We look at the result and think, “Looks good.”


That's not a criticism. The tools make it easy for us to skip this step. But “looks good” is not a quality criterion. It's a gut feeling. And we don't usually base research results on gut feelings.


What do reliability and validity mean when the instrument is a prompt?

The concepts are familiar – but their application to UX AI prompting is not yet.


Think of the prompt as a measuring instrument and the LLM output as a measurement result. Then the same quality criteria apply as for any empirical tool: Reliability asks whether the instrument measures reliably. Validity asks whether it measures the right thing. Together, these two factors determine how much confidence we can place in the results.


The difference to classic instruments: With a questionnaire, the instrument remains stable as long as no one changes the questions. With a prompt, the instrument can change without you doing anything – namely, whenever the model provider releases an update. Your prompt is the same, but the system behind it is not.


How stable is our prompt? Four facets of reliability

Reliability can be viewed in four facets, each covering a different aspect of dependability.


Repeatability: Does the same result come out twice?

Send the same input with the same prompt to the same model three times in a row. How similar are the outputs? With deterministic settings (Temperature 0), we expect high consistency. But many teams use higher temperature values for more creative outputs – and then the question becomes relevant: Do only the formulations vary, or do the content statements also vary?


Hypothetical scenario: You are analyzing user feedback on a banking app. The prompt is to identify the three most important pain points. On the first run, they are “loading times, navigation, security concerns.” On the second run, they are “performance, confusing menu, lack of trust.” Similar in content – but not identical. Which result do you take? And what do you report to the stakeholder?


Robustness: Does the prompt survive a differently worded question?

Real users never phrase things the same way as your test data set. If your prompt is trained on “What bothers you about the app?”, but someone asks “What problems do users have with the application?”, does it deliver the same content?


This is parallel form reliability, applied to UX AI research: How robust is the prompt to natural language variation in the inputs?


Model independence: Does the prompt only work with GPT?

If your prompt delivers good results with GPT-4o but noticeably different results with Claude or Gemini, then it is not measuring the construct, but a model property. This is relevant because model changes occur regularly in practice: for cost reasons, due to provider changes, or because the team has different preferences.


In my consulting practice, I see teams that carefully optimize prompts for a specific model – and then are surprised to find that they are not transferable after a model change. This is not the team's fault. It is a symptom of a lack of reliability testing.


Internal consistency: Do the partial results contradict each other?

Many UX AI prompts address several aspects at once: “Analyze this user feedback by sentiment, topic, and urgency.” If a piece of feedback is classified as “very negative” in sentiment but “low urgency,” is that a valid edge case or a contradiction?


Internal consistency checks whether the partial results of a multi-aspect prompt fit together. In classical test theory, this corresponds to Cronbach's alpha – a measure of whether the items in a test measure the same construct.


Is our prompt measuring the right thing? The question of validity

Reliability is the prerequisite, validity is the goal. A prompt can be highly reliable—i.e., stable, robust, model-independent—and still measure the wrong thing. And this is where it gets really uncomfortable.


Content validity: Does the output cover what it is supposed to cover?

The most fundamental question: Does the output contain all relevant aspects? And does it contain anything that doesn't belong there?


Hypothetical scenario: Your prompt codes interviews on the topic of “onboarding experience.” It reliably identifies topics such as “tutorials,” “help section,” and “first steps.” But it systematically overlooks emotional aspects such as frustration, overwhelm, or feelings of accomplishment—because the category does not query this dimension. The output is not wrong. It is incomplete. And this incompleteness is invisible as long as no one specifically looks for it.


Content validity requires expert assessment: people who know the subject area systematically check whether the instrument covers what it is supposed to cover. This is standard practice for questionnaires. With prompts, almost no one does it.


Criterion validity: Does the output match an external standard?

Here we need an external criterion – something against which we can validate the prompt output. And that is often the most difficult point: What is the gold standard?


Possible criteria:


  • Expert judgment: Experienced UX researchers manually code the same data. Then we compare: How high is the agreement between humans and prompts? This can be expressed as a correlation coefficient or Cohen's Kappa (a measure of agreement between two evaluators, adjusted for chance hits).

  • Observed behavior: If the prompt identifies usage problems, do we find the same problems in usability tests? That would be predictive validity.

  • Business metrics: If the prompt prioritizes suggestions for improvement, do the relevant KPIs improve when we follow the recommendations?


Each of these criteria requires effort. But without an external criterion, any evaluation remains circular reasoning: we evaluate the output of a prompt with an LLM judge whose judgment we have also not validated.


Construct validity: Are we really measuring what we think we are measuring?

The most challenging level. If your LLM judge says “this output is helpful,” what does “helpful” actually mean? Does the evaluation rubric actually measure helpfulness? Or does it measure comprehensiveness, which correlates with helpfulness but is not the same thing?


From research on LLM-as-a-Judge (the approach of using an LLM to evaluate the outputs of another LLM), we know that LLM judges systematically prefer longer answers [Dubois et al., 2024]. This is a classic case of lack of construct validity—the instrument does not measure the construct “quality,” but rather the proxy feature “length.”


This is highly relevant for UX AI research: when we evaluate prompts, we need to ensure that our evaluation criteria actually reflect the quality dimensions that are relevant to our research question – and not something else that happens to correlate with it.


What does the market currently offer – and where does it fall short?

There are now a whole range of evaluation frameworks available. None of them were developed for UX research, but some are useful in parts. Here is an honest overview (as of February 2026):


Approach

What it can do

What is missing

LLM-as-a-Judge (e.g., Pydantic Evals, DeepEval)

Scales subjective quality dimensions based on defined rubrics

Who validates the judge? The rubric itself is a prompt – i.e., the same problem one level higher

RAG evaluation metrics(e.g., RAGAS)

Faithfulness, context recall – good for retrieval systems

Not transferable to open-ended analysis tasks, which are typical in UX research

CI/CD pipelines (e.g., Promptfoo, Braintrust)

Detect changes, automatically warn of regressions

Say “something has changed” – not “it has gotten better or worse”

Pairwise Comparison (e.g., Chatbot Arena)

Collect relative preferences, well validated against human judgment

Not applicable to domain-specific analysis tasks, requires crowd evaluation


All of these approaches solve partial problems. The LLM-as-a-Judge approach comes closest to what we need – but it assumes what actually needs to be proven: that the evaluation criteria are valid. And none of these approaches systematically addresses the question of validity.


Questions we have not yet answered

I am deliberately ending this article without a solution. Not out of convenience, but because I believe we need to formulate the questions clearly before we start building answers. Here are the ones that concern me the most:


Who defines what “good output” means – and according to what procedure?

Today, it is often the person who writes the prompt who decides. But operationalizing quality is a methodological task, not a byproduct of prompt development. Which process is appropriate?


How do we calibrate automated evaluations against human judgment?

Human evaluation is time-consuming. LLM judges are not validated. How do we find a pragmatic middle ground—one that is methodologically sound without requiring full validation in every project?


What is our gold standard if there isn't one?

For many UX research tasks, there is no objectively “correct” answer. If three experienced researchers code an interview differently, what is the reference against which we validate the prompt?


How do we deal with an instrument that changes without our intervention?

Model updates come without warning. The prompt remains the same, but the system behind it does not. It's like someone changing the scale of your questionnaire overnight. What monitoring strategy is appropriate?


And fundamentally: Is it justifiable to use prompt-based analysis in research projects as long as we don't have answers to these questions?

I think so, under certain conditions. But we have to stop pretending that the quality issue has already been resolved. It hasn't. And as the UX research community, we should be the ones to say so openly.


I look forward to the discussion.


Frequently asked questions

What does LLM-as-a-Judge mean?

LLM-as-a-Judge is a process in which an LLM (Large Language Model) evaluates the outputs of another LLM based on defined criteria. It is currently the most common approach to automated evaluation of prompt results – but its own validity is often not tested.


Can't I just check the quality of my prompts on a random basis?

Random sampling is better than no testing at all. But it only tells you whether individual outputs look plausible – not whether the prompt works reliably over time, across different inputs and different models. For systematic quality assurance, you need a structured evaluation process.


As a UX researcher, do I need technical knowledge to evaluate prompts?

The methodological basics—reliability, validity, inter-rater reliability—are already part of the UX research toolkit. The technical implementation (eval frameworks, automated pipelines) requires additional skills or collaboration with developers. But the most important skill—evaluation design—is yours.


Does a model update really change the results of my prompt?

Yes, this is documented and common in practice. Model updates can cause subtle behavioral changes that affect the outputs of your prompt—even if you haven't changed the prompt itself. Without systematic monitoring, such changes often go unnoticed.





About the author: Tara Maria Bosenick has been working as a UX consultant since 1999, supporting companies at the interface of user research and technology. She has extensive experience with qualitative and quantitative research methods and is currently focusing intensively on the question of how AI-supported analysis can be methodologically validated in UX research.


💌 Want more? Then read on—in our newsletter.

Published four times a year. Sticks in your mind longer. https://www.uintent.com/de/newsletter



Surreal futuristic illustration of a glowing digital head with data streams, charts, and evaluation symbols representing AI evaluation methodology.

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

AI WRITING, DIGITISATION, HOW-TO, PROMPTS

A surreal, futuristic illustration featuring a translucent human profile with a glowing brain connected by flowing data streams to a hovering, golden crystal.

Prompt Psychology Exposed: Why “Tipping” ChatGPT Sometimes Works

CHAT GPT, HOW-TO, LLM, UX

Surreal, futuristic illustration of a person seen from behind standing in a glowing digital cityscape.

System Prompts in UX Research: What You Need to Know About Invisible AI Control

PROMPTS, RESEARCH, UX, UX INSIGHTS

Abstract futuristic illustration of a person, various videos, and notes.

Summarizing YouTube Videos With AI: Three Tools Put to the Test in UX Research

LLM, UX, HOW-TO

two folded hands holding a growing plant

UX For a Better World: We Are Giving Away a UX Research Project to Non-profit Organisations and Sustainable Companies!

UX INSIGHTS, UX FOR GOOD, TRENDS, RESEARCH

Abstract futuristic illustration of a person facing a glowing tower of documents and flowing data streams.

AI Tools UX Research: How Do These Tools Handle Large Documents?

LLM, CHAT GPT, HOW-TO

Illustration of Donald Trump with raised hand in front of an abstract digital background suggesting speech bubbles and data structures.

Donald Trump Prompt: How Provocative AI Prompts Affect UX Budgets

AI & UXR, PROMPTS, STAKEHOLDER MANAGEMENT

Driver's point of view looking at a winding country road surrounded by green vegetation. The steering wheel, dashboard and rear-view mirror are visible in the foreground.

The Final Hurdle: How Unsafe Automation Undermines Trust in Adas

AUTOMATION, AUTOMOTIVE UX, AUTONOMOUS DRIVING, GAMIFICATION, TRENDS

Illustration of a person standing at a fork in the road with two equal paths.

Will AI Replace UX Jobs? What a Study of 200,000 AI Conversations Really Shows

HUMAN VS AI, RESEARCH, AI & UXR

Close-up of a premium tweeter speaker in a car dashboard with perforated metal surface.

The Passenger Who Always Listens: Why We Are Reluctant to Trust Our Cars When They Talk

AUTOMOTIVE UX, VOICE ASSISTANTS

Keyhole in a dark surface revealing an abstract, colorful UX research interface.

Evaluating AI Results in UX Research: How to Navigate the Black Box

AI & UXR, HOW-TO, HUMAN VS AI

A car cockpit manufactured by Audi. It features a digital display and numerous buttons on the steering wheel.

Haptic Certainty vs. Digital Temptation: The Battle for the Best Controls in Cars

AUTOMOTIVE UX, AUTONOMOUS DRIVING, CONNECTIVITY, GAMIFICATION

Digital illustration of a classical building facade with columns, supported by visible scaffolding, symbolising a fragile, purely superficial front.

UX & AI: How "UX Potemkin" Undermines Your Research and Design Decisions

AI & UXR, HUMAN VS AI, LLM, UX

Silhouette of a diver descending into deep blue water – a metaphor for in-depth research.

Deep Research AI | How to use ChatGPT effectively for UX work

CHAT GPT, HOW-TO, RESEARCH, AI & UXR

A referee holds up a scorecard labeled “Yupp.ai” between two stylized AI chatbots in a boxing ring – a symbolic image for fair user-based comparison of AI models.

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

AI & UXR, CHAT GPT, HUMAN VS AI, LLM

A brown book entitled ‘Don't Make Me Think’ by Steve Krug lies on a small table. Light shines through the window.

Why UX Research Is Losing Credibility - And How We Can Regain It

UX, UX QUALITY, UX METHODS

3D illustration of a digital marketplace with colorful prompt stalls and a figure selecting a prompt card.

Buying, sharing, selling prompts – what prompt marketplaces offer today (and why this is relevant for UX)

AI & UXR, PROMPTS

Robot holds two signs: “ISO 9241 – 7 principles” and “ISO 9241 – 10 principles”

ChatGPT Hallucinates – Despite Anti-Hallucination Prompt

AI & UXR, HUMAN VS AI, CHAT GPT

Strawberry being sliced by a knife, stylized illustration.

Why AI Sometimes Can’t Count to 3 – And What That Has to Do With Tokens

AI & UXR, TOKEN, LLM

Square motif divided in the middle: on the left, a grey, stylised brain above a seated person working on a laptop in dark grey tones; on the right, a bright blue, networked brain above a standing person in front of a holographic interface on a dark background.

GPT-5 Is Here: Does This UX AI Really Change Everything for Researchers?

AI & UXR, CHAT GPT

 RELATED ARTICLES YOU MIGHT ENJOY 

AUTHOR

Tara Bosenick

Tara has been active as a UX specialist since 1999 and has helped to establish and shape the industry in Germany on the agency side. She specialises in the development of new UX methods, the quantification of UX and the introduction of UX in companies.


At the same time, she has always been interested in developing a corporate culture in her companies that is as ‘cool’ as possible, in which fun, performance, team spirit and customer success are interlinked. She has therefore been supporting managers and companies on the path to more New Work / agility and a better employee experience for several years.


She is one of the leading voices in the UX, CX and Employee Experience industry.

bottom of page