AI WRITING, DIGITISATION, HOW-TO, PROMPTS

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

MIN

Feb 26, 2026

Imagine this: You've coded 40 user interviews with a carefully crafted prompt. The results look plausible. The team is satisfied, and the project moves forward. But then you quietly ask yourself: Would the same results have come out yesterday? And if my colleague uses the same prompt, will he come to the same conclusion? What will happen after the next model update?

These questions are not hypothetical. They affect every UX team that uses AI-powered analysis—which is quite a few of us by now.

In my work as a UX consultant, I have seen a rapid shift since 2023: prompts are becoming an analysis tool. For sentiment analysis, for coding qualitative data, for synthesizing user feedback. The results are often impressive. But in doing so, we are skipping a step that we would naturally require for any other research tool: checking the quality criteria.

This article highlights the problem. It does not provide a ready-made solution – but it does ask the questions that we as a UX research community urgently need to answer.

📌 The most important points in brief

Prompts are measuring instruments – and need the same methodological testing as questionnaires or test protocols.
Reliability (stability, robustness, model independence) can be tested partially automatically – but hardly anyone does so.
Validity (does the prompt measure the right thing?) is the more difficult and important question—and currently completely unresolved.
Existing evaluation frameworks from engineering solve sub-problems but do not systematically address validity.
UX research has the methodological knowledge to address this problem—but does not yet apply it to prompts.
Without quality criteria for prompts, we risk basing decisions on untested analyses.

Why should we care about quality criteria for prompts?

Because as UX researchers, we make decisions based on data – and the quality of that data depends directly on the quality of our instruments.

With a questionnaire, we ask: Is it reliable? Is it valid? Has it been piloted? With a usability test, we check whether the tasks measure what we want to measure. But with a prompt that categorizes 200 customer reviews according to pain points? We look at the result and think, “Looks good.”

That's not a criticism. The tools make it easy for us to skip this step. But “looks good” is not a quality criterion. It's a gut feeling. And we don't usually base research results on gut feelings.

What do reliability and validity mean when the instrument is a prompt?

The concepts are familiar – but their application to UX AI prompting is not yet.

Think of the prompt as a measuring instrument and the LLM output as a measurement result. Then the same quality criteria apply as for any empirical tool: Reliability asks whether the instrument measures reliably. Validity asks whether it measures the right thing. Together, these two factors determine how much confidence we can place in the results.

The difference to classic instruments: With a questionnaire, the instrument remains stable as long as no one changes the questions. With a prompt, the instrument can change without you doing anything – namely, whenever the model provider releases an update. Your prompt is the same, but the system behind it is not.

How stable is our prompt? Four facets of reliability

Reliability can be viewed in four facets, each covering a different aspect of dependability.

Repeatability: Does the same result come out twice?

Send the same input with the same prompt to the same model three times in a row. How similar are the outputs? With deterministic settings (Temperature 0), we expect high consistency. But many teams use higher temperature values for more creative outputs – and then the question becomes relevant: Do only the formulations vary, or do the content statements also vary?

Hypothetical scenario: You are analyzing user feedback on a banking app. The prompt is to identify the three most important pain points. On the first run, they are “loading times, navigation, security concerns.” On the second run, they are “performance, confusing menu, lack of trust.” Similar in content – but not identical. Which result do you take? And what do you report to the stakeholder?

Robustness: Does the prompt survive a differently worded question?

Real users never phrase things the same way as your test data set. If your prompt is trained on “What bothers you about the app?”, but someone asks “What problems do users have with the application?”, does it deliver the same content?

This is parallel form reliability, applied to UX AI research: How robust is the prompt to natural language variation in the inputs?

Model independence: Does the prompt only work with GPT?

If your prompt delivers good results with GPT-4o but noticeably different results with Claude or Gemini, then it is not measuring the construct, but a model property. This is relevant because model changes occur regularly in practice: for cost reasons, due to provider changes, or because the team has different preferences.

In my consulting practice, I see teams that carefully optimize prompts for a specific model – and then are surprised to find that they are not transferable after a model change. This is not the team's fault. It is a symptom of a lack of reliability testing.

Internal consistency: Do the partial results contradict each other?

Many UX AI prompts address several aspects at once: “Analyze this user feedback by sentiment, topic, and urgency.” If a piece of feedback is classified as “very negative” in sentiment but “low urgency,” is that a valid edge case or a contradiction?

Internal consistency checks whether the partial results of a multi-aspect prompt fit together. In classical test theory, this corresponds to Cronbach's alpha – a measure of whether the items in a test measure the same construct.

Is our prompt measuring the right thing? The question of validity

Reliability is the prerequisite, validity is the goal. A prompt can be highly reliable—i.e., stable, robust, model-independent—and still measure the wrong thing. And this is where it gets really uncomfortable.

Content validity: Does the output cover what it is supposed to cover?

The most fundamental question: Does the output contain all relevant aspects? And does it contain anything that doesn't belong there?

Hypothetical scenario: Your prompt codes interviews on the topic of “onboarding experience.” It reliably identifies topics such as “tutorials,” “help section,” and “first steps.” But it systematically overlooks emotional aspects such as frustration, overwhelm, or feelings of accomplishment—because the category does not query this dimension. The output is not wrong. It is incomplete. And this incompleteness is invisible as long as no one specifically looks for it.

Content validity requires expert assessment: people who know the subject area systematically check whether the instrument covers what it is supposed to cover. This is standard practice for questionnaires. With prompts, almost no one does it.

Criterion validity: Does the output match an external standard?

Here we need an external criterion – something against which we can validate the prompt output. And that is often the most difficult point: What is the gold standard?

Possible criteria:

Expert judgment: Experienced UX researchers manually code the same data. Then we compare: How high is the agreement between humans and prompts? This can be expressed as a correlation coefficient or Cohen's Kappa (a measure of agreement between two evaluators, adjusted for chance hits).
Observed behavior: If the prompt identifies usage problems, do we find the same problems in usability tests? That would be predictive validity.
Business metrics: If the prompt prioritizes suggestions for improvement, do the relevant KPIs improve when we follow the recommendations?

Each of these criteria requires effort. But without an external criterion, any evaluation remains circular reasoning: we evaluate the output of a prompt with an LLM judge whose judgment we have also not validated.

Construct validity: Are we really measuring what we think we are measuring?

The most challenging level. If your LLM judge says “this output is helpful,” what does “helpful” actually mean? Does the evaluation rubric actually measure helpfulness? Or does it measure comprehensiveness, which correlates with helpfulness but is not the same thing?

From research on LLM-as-a-Judge (the approach of using an LLM to evaluate the outputs of another LLM), we know that LLM judges systematically prefer longer answers [Dubois et al., 2024]. This is a classic case of lack of construct validity—the instrument does not measure the construct “quality,” but rather the proxy feature “length.”

This is highly relevant for UX AI research: when we evaluate prompts, we need to ensure that our evaluation criteria actually reflect the quality dimensions that are relevant to our research question – and not something else that happens to correlate with it.

What does the market currently offer – and where does it fall short?

There are now a whole range of evaluation frameworks available. None of them were developed for UX research, but some are useful in parts. Here is an honest overview (as of February 2026):

Approach	What it can do	What is missing
LLM-as-a-Judge (e.g., Pydantic Evals, DeepEval)	Scales subjective quality dimensions based on defined rubrics	Who validates the judge? The rubric itself is a prompt – i.e., the same problem one level higher
RAG evaluation metrics(e.g., RAGAS)	Faithfulness, context recall – good for retrieval systems	Not transferable to open-ended analysis tasks, which are typical in UX research
CI/CD pipelines (e.g., Promptfoo, Braintrust)	Detect changes, automatically warn of regressions	Say “something has changed” – not “it has gotten better or worse”
Pairwise Comparison (e.g., Chatbot Arena)	Collect relative preferences, well validated against human judgment	Not applicable to domain-specific analysis tasks, requires crowd evaluation

All of these approaches solve partial problems. The LLM-as-a-Judge approach comes closest to what we need – but it assumes what actually needs to be proven: that the evaluation criteria are valid. And none of these approaches systematically addresses the question of validity.

Questions we have not yet answered

I am deliberately ending this article without a solution. Not out of convenience, but because I believe we need to formulate the questions clearly before we start building answers. Here are the ones that concern me the most:

Who defines what “good output” means – and according to what procedure?

Today, it is often the person who writes the prompt who decides. But operationalizing quality is a methodological task, not a byproduct of prompt development. Which process is appropriate?

How do we calibrate automated evaluations against human judgment?

Human evaluation is time-consuming. LLM judges are not validated. How do we find a pragmatic middle ground—one that is methodologically sound without requiring full validation in every project?

What is our gold standard if there isn't one?

For many UX research tasks, there is no objectively “correct” answer. If three experienced researchers code an interview differently, what is the reference against which we validate the prompt?

How do we deal with an instrument that changes without our intervention?

Model updates come without warning. The prompt remains the same, but the system behind it does not. It's like someone changing the scale of your questionnaire overnight. What monitoring strategy is appropriate?

And fundamentally: Is it justifiable to use prompt-based analysis in research projects as long as we don't have answers to these questions?

I think so, under certain conditions. But we have to stop pretending that the quality issue has already been resolved. It hasn't. And as the UX research community, we should be the ones to say so openly.

I look forward to the discussion.

Frequently asked questions

What does LLM-as-a-Judge mean?

LLM-as-a-Judge is a process in which an LLM (Large Language Model) evaluates the outputs of another LLM based on defined criteria. It is currently the most common approach to automated evaluation of prompt results – but its own validity is often not tested.

Can't I just check the quality of my prompts on a random basis?

Random sampling is better than no testing at all. But it only tells you whether individual outputs look plausible – not whether the prompt works reliably over time, across different inputs and different models. For systematic quality assurance, you need a structured evaluation process.

As a UX researcher, do I need technical knowledge to evaluate prompts?

The methodological basics—reliability, validity, inter-rater reliability—are already part of the UX research toolkit. The technical implementation (eval frameworks, automated pipelines) requires additional skills or collaboration with developers. But the most important skill—evaluation design—is yours.

Does a model update really change the results of my prompt?

Yes, this is documented and common in practice. Model updates can cause subtle behavioral changes that affect the outputs of your prompt—even if you haven't changed the prompt itself. Without systematic monitoring, such changes often go unnoticed.

About the author: Tara Maria Bosenick has been working as a UX consultant since 1999, supporting companies at the interface of user research and technology. She has extensive experience with qualitative and quantitative research methods and is currently focusing intensively on the question of how AI-supported analysis can be methodologically validated in UX research.

💌 Want more? Then read on—in our newsletter.

Published four times a year. Sticks in your mind longer. https://www.uintent.com/de/newsletter