top of page
uintent company logo

AI & UXR, CHAT GPT, HUMAN VS AI, LLM

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

3

MIN

Oct 30, 2025

The most important points in a nutshell:

  • Yupp compares AI responses via crowd voting

  • Users evaluate quality, speed and clarity

  • Evaluations are statistically analysed as pair comparisons

  • The VIBE score shows which model performs better in everyday use

  • Bias is actively controlled through blind tests

  • Segmentation shows: model selection depends on the context of use

  • Practical model for UX testing methods

 

Introduction: What if feedback is the product?

Are you familiar with this? You ask ChatGPT, Claude or Gemini the same question – and get three completely different answers. Sometimes one is brilliant, sometimes totally off the mark. But who actually decides which one is ‘better’? And according to what criteria?


This is where Yupp.ai comes in. A platform that makes precisely such comparisons its principle. It shows how users can contribute to the evaluation of AI models by simply providing feedback. And what does that have to do with UX? A lot. Because many of the methods Yupp uses are familiar from our practice – only on a much larger scale.


I have been working as a UX consultant on global projects for many years. What fascinates me about Yupp is that the platform cleverly combines UX methodology and AI evaluation. And it's an excellent source of inspiration for your own testing processes.

  

How exactly does Yupp work?

Yupp is not a classic AI platform, but a ‘meta’ system: you enter a question and receive answers from several AI models. Your task: decide which answer you like better – and why.

 

The key point is that these evaluations are not simply incorporated into a star rating. Instead, Yupp uses the Bradley-Terry model – a pair comparison method that creates a consistent ranking from many individual decisions. The result: the VIBE score (‘Value Informed Benchmark Evaluation’) shows which model is the most convincing in a direct comparison.

  

What criteria are used for evaluation?

Yupp does not only evaluate based on ‘likability’. Several dimensions play a role:

  • Answer quality: How clear, helpful and relevant is the answer?

  • Answer speed: How quickly does the model respond?

  • Cost: What does an answer cost, e.g. when using an API?

  • Confidence: Does the model make clear statements or does it remain vague?


These values are analysed together with user feedback – depending on weighting and target group.


Practical example:

In an experiment with factual vs. creative prompts, Claude and GPT-4 performed differently: Claude was better at reasoning, GPT-4 was better at storytelling. However, the evaluation was not based solely on the length of the response or the facts, but on user perception.

 

What about bias? Can the evaluations be trusted?

Good question. Yupp actively tests for bias. For example, through blind tests: the model names are hidden so that users do not know whether the response comes from GPT-4 or Claude.

 

This reduces what is known as brand bias. At the same time, systematic differences between user groups are taken into account (e.g. beginners vs. AI power users).


UX parallel:

Blinding is also an important tool in usability research to avoid perception bias. Yupp applies this principle to AI evaluation – in a scalable and data-driven way.

  

Why segmentation is so important

Not every question is the same. That's why Yupp also analyses the context of the queries:

  • Is it a factual question or a creative one?

  • Is the questioner technically savvy or more of a layman?

This results in segment scores that show which model performs particularly well in which use cases. For us UX professionals, this is a clear lesson: blanket values are of little use. What matters is performance in the context of use.


Example:

A model may be very good on average – but fail when it comes to accessible applications or sensitive health issues. Yupp makes such differences visible.

 

What happens to the feedback?

This is where it gets exciting: feedback is not an add-on at Yupp – it is the product. The platform sells anonymised evaluation data to AI providers, who use it to improve their models. In return, users receive Yupp credits that can be cashed out (max. £50/month).

This means that users become real data suppliers – fairly remunerated and transparent. This is also an interesting idea for the UX industry: what if user feedback were not only collected, but also used strategically and monetarily?


FAQ: What UX teams want to know about Yupp


1. Do I need programming skills to use Yupp?

 No. The platform is very low-threshold. Enter your question, compare answers, done.

2. How many models are compared?

Usually two to four per request. Mostly GPT-4, Claude, Gemini and Grok are included.

3. Can I also give feedback anonymously?

Yes. You only need an account, but your ratings are not stored in a personalised manner.

4. Is there an API for my own tests?

Not officially yet. However, Yupp plans to offer evaluation-as-a-service for companies.

5. How does this benefit me as a UX team?

 Yupp is a source of inspiration: for evaluation logic, bias checks, segment analyses and feedback systems – all topics that UX teams deal with on a daily basis.


Conclusion: What we as a UX community can learn from Yupp

Yupp shows that user-centred feedback is possible on a large scale – without losing depth. The platform uses methods we know from UX practice and brings them to a scalable, evaluable level.


UX teams would do well to take a look at Yupp in order to:

  • reflect on their own testing processes,

  • get new ideas for AI evaluations,

  • and make better decisions when choosing models.


Want to systematically test your own prompts? Or understand how other models perform? Then take a look at Yupp.ai. 


💌 Not enough? Then read on – in our newsletter. It comes four times a year. Sticks in your mind longer. To subscribe: https://www.uintent.com/newsletter

Person at desk between chaotic and structured data streams, central light focus

UX & AI: The Best Newsletters and Podcasts – My Personal Selection

AI & UXR

Futuristic digital illustration: A glowing golden certification seal floating against a deep navy background, surrounded by AR interface fragments and a faint headset silhouette – symbolizing trust and validation in medical technology.

Trust, but Verified: Why Medical Certification Matters for AR, VR, and Mr in Medtech

HEALTHCARE, HUMAN-CENTERED DESIGN, UX

Floating semi-transparent AR interface with minimal medical data and anatomical visuals, glowing in cyan and gold against a dark futuristic background.

Making the Magic Usable: Why Usability Engineering Matters for AR, VR, and MR in Medtech

HEALTHCARE, MHEALTH

A futuristic, symbolic illustration shows a person standing on a glowing bridge between two worlds: on the left, a warmly lit hospital room with a bed and medical equipment; on the right, an immersive digital space featuring a holographic human body with organs glowing in cyan and orange tones. Both sides are connected by flowing streams of light, set against a deep navy blue background with soft violet transitions.

Reality, Reimagined: How AR, VR, and Mr Are Finding Their Way Into Medtech

DIGITISATION, HEALTHCARE

A glowing golden trophy floats above a gap, while small figures below work on user research and wireframes, untouched by its light.

Understanding UX AI Benchmarks: What HLE and METR Really Tell Us About AI Tools

AI & UXR

Futuristic digital illustration on a deep navy background: a human hand holding a warm glowing pencil and a cyan-lit robotic hand both reach toward a radiant central data cluster. Surrounded by stacked documents and a network of connected nodes, the scene symbolizes collaboration between human interpretation and digital information processing.

NotebookLM in UX Research: An Honest Assessment of a Specialized AI Tool

AI & UXR, HOW-TO, LLM

Futuristic glowing cylinder divided into segments by golden barriers.

Introducing Gated Salami Prompting: Why You Should Slice Complex LLM Tasks Into Smaller Pieces

CHAT GPT, HOW-TO, LLM, PROMPTS

Futuristic square illustration on deep navy background: a glowing golden speech bubble dissolves into particles that partially reassemble incorrectly, surrounded by energy arcs, luminous nodes, and a stylized digital head—symbolizing LLM hallucinations.

Fictitious Quotes, Lost Nuances: The Hallucination Problem in Qualitative Analysis With Llms

CHAT GPT, HOW-TO, LLM, OPEN AI, PROMPTS, TOKEN, UX METHODS

Surreal futuristic illustration of a glowing digital head with data streams, charts, and evaluation symbols representing AI evaluation methodology.

How do we know that our prompt is doing a good job? Why UX research needs an evaluation methodology for AI-based analysis

AI WRITING, DIGITISATION, HOW-TO, PROMPTS

A surreal, futuristic illustration featuring a translucent human profile with a glowing brain connected by flowing data streams to a hovering, golden crystal.

Prompt Psychology Exposed: Why “Tipping” ChatGPT Sometimes Works

CHAT GPT, HOW-TO, LLM, UX

Surreal, futuristic illustration of a person seen from behind standing in a glowing digital cityscape.

System Prompts in UX Research: What You Need to Know About Invisible AI Control

PROMPTS, RESEARCH, UX, UX INSIGHTS

Abstract futuristic illustration of a person, various videos, and notes.

Summarizing YouTube Videos With AI: Three Tools Put to the Test in UX Research

LLM, UX, HOW-TO

two folded hands holding a growing plant

UX For a Better World: We Are Giving Away a UX Research Project to Non-profit Organisations and Sustainable Companies!

UX INSIGHTS, UX FOR GOOD, TRENDS, RESEARCH

Abstract futuristic illustration of a person facing a glowing tower of documents and flowing data streams.

AI Tools UX Research: How Do These Tools Handle Large Documents?

LLM, CHAT GPT, HOW-TO

Illustration of Donald Trump with raised hand in front of an abstract digital background suggesting speech bubbles and data structures.

Donald Trump Prompt: How Provocative AI Prompts Affect UX Budgets

AI & UXR, PROMPTS, STAKEHOLDER MANAGEMENT

Driver's point of view looking at a winding country road surrounded by green vegetation. The steering wheel, dashboard and rear-view mirror are visible in the foreground.

The Final Hurdle: How Unsafe Automation Undermines Trust in Adas

AUTOMATION, AUTOMOTIVE UX, AUTONOMOUS DRIVING, GAMIFICATION, TRENDS

Illustration of a person standing at a fork in the road with two equal paths.

Will AI Replace UX Jobs? What a Study of 200,000 AI Conversations Really Shows

HUMAN VS AI, RESEARCH, AI & UXR

Close-up of a premium tweeter speaker in a car dashboard with perforated metal surface.

The Passenger Who Always Listens: Why We Are Reluctant to Trust Our Cars When They Talk

AUTOMOTIVE UX, VOICE ASSISTANTS

Keyhole in a dark surface revealing an abstract, colorful UX research interface.

Evaluating AI Results in UX Research: How to Navigate the Black Box

AI & UXR, HOW-TO, HUMAN VS AI

A car cockpit manufactured by Audi. It features a digital display and numerous buttons on the steering wheel.

Haptic Certainty vs. Digital Temptation: The Battle for the Best Controls in Cars

AUTOMOTIVE UX, AUTONOMOUS DRIVING, CONNECTIVITY, GAMIFICATION

 RELATED ARTICLES YOU MIGHT ENJOY 

AUTHOR

Tara Bosenick

Tara has been active as a UX specialist since 1999 and has helped to establish and shape the industry in Germany on the agency side. She specialises in the development of new UX methods, the quantification of UX and the introduction of UX in companies.


At the same time, she has always been interested in developing a corporate culture in her companies that is as ‘cool’ as possible, in which fun, performance, team spirit and customer success are interlinked. She has therefore been supporting managers and companies on the path to more New Work / agility and a better employee experience for several years.


She is one of the leading voices in the UX, CX and Employee Experience industry.

bottom of page