AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI

Anecdotal Evidence or Systematic AI Research – The Current Situation and What Still Needs to Be Done

MIN

Mar 4, 2025

When interacting with AI systems such as ChatGPT, Gemini or Claude, there is certainly valuable anecdotal evidence that can help users to make communication more efficient. One example is the finding that many models respond better in English because they can handle complex queries more precisely in this language. This is often because they were trained predominantly on English-language data sets, leading to richer language comprehension. For users, this means that a query can sometimes be improved by switching to English, especially for very technical or scientific topics (see blog article).

Another example of constructive anecdotal evidence is asking follow-up questions. Since AI models sometimes tend to overlook important aspects or generalise, it helps to ask the AI specific and multiple follow-up questions for details. A dialogue in several steps can often help to obtain all relevant information and close knowledge gaps. This form of follow-up question has proven effective in practice for ensuring that the answer really does cover all the desired aspects (see blog article).

Likewise, tone is not insignificant in interactions with AI. Studies on human-AI communication show that a polite and structured form of address often leads to better structured answers – possibly because the model interprets politeness as ‘serious interest’ and thus focuses more specifically on the question asked. Anecdotes from everyday life also show that respectful language improves the flow of conversation and reduces misunderstandings (see blog article).

But while there are these positive ‘application anecdotes’ that can help users to improve the quality of AI results, there are also numerous bizarre and absurd failures that raise questions about the quality and reliability of AI. The humorous aspect of these anecdotes ensures that such failures often attract a lot of attention and may go viral, e.g. the question of how many ‘r’s are in the word ‘Strawberry’ and ChatGPT's inability to answer this question correctly (see blog article). Unfortunately, however, they obscure the need to develop systematic quality assessment standards for AI – a topic that is still in its early stages in scientific research.

State of research on systematic quality assessment

In scientific research, there are initial approaches to developing systematic methods for assessing the quality of AI results. The Multidimensional Quality Metrics (MQM) approach offers a way to evaluate translations or generated text based on specific categories such as coherence, grammar and relevance. Projects such as the AIGIQA-20K database, available at CVPR 2024 Open Access Repository, take this approach to gain more systematic insights into the quality of AI-generated images (CVF Open Access ).

Another example of systematic evaluation approaches are frameworks such as TruEra, which, in addition to technical quality standards, also take into account ethical and societal criteria such as fairness and transparency. This combined perspective helps to assess the quality and social acceptability of AI outputs more comprehensively, as can be read about at Truerab.

Regulatory frameworks, such as the one emerging with the EU AI Act, also indicate an increasing recognition of the need to make AI systems reliable and transparent. These regulations are intended to ensure, particularly in safety-critical areas such as healthcare, that AI outputs are accurate and verifiable and that users can trust the systems.

Challenges and opportunities for better AI results

However, the road to a truly comprehensive quality assessment remains complex. Context-dependent adaptations and continuous maintenance are indispensable when evaluating AI outputs. A simple, standardised metric often does not do justice to the diversity of application scenarios. To achieve a realistic evaluation, researchers are therefore experimenting with hybrid models that combine human control – as in ‘human-in-the-loop’ approaches – and machine evaluations. Particularly in sensitive areas such as medicine, it has been shown that such a combination delivers more reliable and trustworthy results.

Another example of the importance of context-specific metrics is the BIG Bench project, which tests language models for creative abilities and logical thinking in addition to pure accuracy (ProjectPro ). Such sophisticated benchmarks help to ensure that AI not only answers correctly but also ‘intelligently’ – an important distinction for better mastering complex tasks.

Limitations and potential of AI in everyday life: humorous anecdotes and their significance

Besides constructive applications, there are also numerous absurd misinterpretations and misunderstandings that often make AI results seem humorous. Here are a few examples of such anecdotes (found, for example, on https://dataconomy.com/2024/06/03/funny-ai-fails-moments/):

Edible stones: An AI suggested eating a small stone every day to get important minerals. This recommendation caused a lot of amusement because it is completely misleading and dangerous to health.
Pizza glue: An AI recommended adding non-toxic glue to pizza to prevent the cheese from sliding off. While it may be a creative solution, it shows a lack of understanding of eating habits and safety.
Hamsters recognised as people: In a facial recognition system, a hamster was classified as a human face, calling into question the reliability of the technology.
Eight-day week for more fitness: An AI system suggested training eight days a week to stay fit. A ‘timeless’ recommendation that shows that even basic concepts like time are difficult for AI models to grasp.
Balenciaga Pope: An AI-generated image showing the Pope wearing a Balenciaga jacket went viral because it looked both realistic and completely surreal at first glance. Many people thought the picture was real until it became clear that it was AI-generated.

Such failures highlight the existing limitations of AI systems and show that the models often have difficulties correctly interpreting unusual contexts.

Social and ethical implications

Such ‘humorous’ failures can influence public perception and trust in AI systems. Especially in safety-critical areas such as medicine or law, mistakes could have serious consequences. Therefore, the necessity of ‘explainable AI’ is increasingly being emphasised in order to make decisions comprehensible and understandable. A transparent explanation of the decision-making process can strengthen trust in AI and at the same time help to avoid misunderstandings or misinterpretations.

The responsibility for such standards also lies with the developers and providers. By introducing clear quality management processes and showing users how AI decisions are made, they help to make the use of AI safer and more comprehensible.

Additional perspectives for quality assessment

Apart from systematic quality standards and ethical guidelines, structured data quality management is crucial. The quality of the data used directly influences the accuracy and relevance of the AI outputs and ensures that biases and errors are minimised. Regulatory measures such as the ‘EU AI Act’ provide initial indications of the importance of a reliable database (The legal texts | EU law on artificial intelligence).

Another aspect is training and raising awareness among users. A basic understanding of AI and how it works helps to avoid misunderstandings and to evaluate the technology more realistically. User training programmes have a key role to play in showing people how to see the anecdotal failures for what they are: curious exceptions that say nothing about the general quality of the AI.

Conclusion and outlook

The anecdotes about tips for using AI and AI errors are just that: anecdotes. We need new evaluation methods (such as human-in-the-loop and context-specific quality metrics) to better align AI models with real-world needs and use them responsibly.

A referee holds up a scorecard labeled “Yupp.ai” between two stylized AI chatbots in a boxing ring – a symbolic image for fair user-based comparison of AI models.

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

AI & UXR, CHAT GPT, HUMAN VS AI, LLM

AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI

Anecdotal Evidence or Systematic AI Research – The Current Situation and What Still Needs to Be Done

​

State of research on systematic quality assessment

Challenges and opportunities for better AI results

Limitations and potential of AI in everyday life: humorous anecdotes and their significance

Social and ethical implications

Additional perspectives for quality assessment

Conclusion and outlook

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

Buying, sharing, selling prompts – what prompt marketplaces offer today (and why this is relevant for UX)

ChatGPT Hallucinates – Despite Anti-Hallucination Prompt

Why AI Sometimes Can’t Count to 3 – And What That Has to Do With Tokens

GPT-5 Is Here: Does This UX AI Really Change Everything for Researchers?

When AI Paints Pictures – And Suddenly Knows How to Spell

When the Text Is Too Smooth: How to Make AI Language More Human

Not Science Fiction – AI Is Becoming Independent

Between Argument and Influence – How Persuasive Can AI Be?

Digital Health Apps & Interfaces: Why Good UX Determines Whether Patients Really Benefit

Censorship Meets AI: What Deepseek Is Hiding About Human Rights – And Why This Affects UX

What It Takes to Get It Right: Global Study Logistics in UX Research for Medical Devices

Propaganda Chatbots - When AI Suddenly Speaks Russian

Welcome to the Prompt Zoo

UX Regulatory Compliance: Why Usability Drives Medtech Certification

Why Prompts That Produce Bias and Hallucinations Can Sometimes Be Helpful

Global UX Research in Medical Technology: International User Research as a Factor for Success

AI, Bias and the Power of Questions: How to Get Better Answers With Smart Prompts

Automate UX? Yes, Please! Why Zapier and n8n Are Real Super Tools for UX Teams

Surgical Robotics and UX: Why Usability Is Key to or Success

RELATED ARTICLES YOU MIGHT ENJOY

AUTHOR