top of page
uintent company logo

AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI

Anecdotal Evidence or Systematic AI Research – The Current Situation and What Still Needs to Be Done


4

MIN

Mar 4, 2025

When interacting with AI systems such as ChatGPT, Gemini or Claude, there is certainly valuable anecdotal evidence that can help users to make communication more efficient. One example is the finding that many models respond better in English because they can handle complex queries more precisely in this language. This is often because they were trained predominantly on English-language data sets, leading to richer language comprehension. For users, this means that a query can sometimes be improved by switching to English, especially for very technical or scientific topics (see blog article).


Another example of constructive anecdotal evidence is asking follow-up questions. Since AI models sometimes tend to overlook important aspects or generalise, it helps to ask the AI specific and multiple follow-up questions for details. A dialogue in several steps can often help to obtain all relevant information and close knowledge gaps. This form of follow-up question has proven effective in practice for ensuring that the answer really does cover all the desired aspects (see blog article).  


Likewise, tone is not insignificant in interactions with AI. Studies on human-AI communication show that a polite and structured form of address often leads to better structured answers – possibly because the model interprets politeness as ‘serious interest’ and thus focuses more specifically on the question asked. Anecdotes from everyday life also show that respectful language improves the flow of conversation and reduces misunderstandings (see blog article).  


But while there are these positive ‘application anecdotes’ that can help users to improve the quality of AI results, there are also numerous bizarre and absurd failures that raise questions about the quality and reliability of AI. The humorous aspect of these anecdotes ensures that such failures often attract a lot of attention and may go viral, e.g. the question of how many ‘r’s are in the word ‘Strawberry’ and ChatGPT's inability to answer this question correctly (see blog article). Unfortunately, however, they obscure the need to develop systematic quality assessment standards for AI – a topic that is still in its early stages in scientific research.


State of research on systematic quality assessment 

In scientific research, there are initial approaches to developing systematic methods for assessing the quality of AI results. The Multidimensional Quality Metrics (MQM) approach offers a way to evaluate translations or generated text based on specific categories such as coherence, grammar and relevance. Projects such as the AIGIQA-20K database, available at CVPR 2024 Open Access Repository, take this approach to gain more systematic insights into the quality of AI-generated images (CVF Open Access).  


Another example of systematic evaluation approaches are frameworks such as TruEra, which, in addition to technical quality standards, also take into account ethical and societal criteria such as fairness and transparency. This combined perspective helps to assess the quality and social acceptability of AI outputs more comprehensively, as can be read about at Truerab.


Regulatory frameworks, such as the one emerging with the EU AI Act, also indicate an increasing recognition of the need to make AI systems reliable and transparent. These regulations are intended to ensure, particularly in safety-critical areas such as healthcare, that AI outputs are accurate and verifiable and that users can trust the systems.


 

Challenges and opportunities for better AI results 

However, the road to a truly comprehensive quality assessment remains complex. Context-dependent adaptations and continuous maintenance are indispensable when evaluating AI outputs. A simple, standardised metric often does not do justice to the diversity of application scenarios. To achieve a realistic evaluation, researchers are therefore experimenting with hybrid models that combine human control – as in ‘human-in-the-loop’ approaches – and machine evaluations. Particularly in sensitive areas such as medicine, it has been shown that such a combination delivers more reliable and trustworthy results.

 

Another example of the importance of context-specific metrics is the BIG Bench project, which tests language models for creative abilities and logical thinking in addition to pure accuracy (ProjectPro). Such sophisticated benchmarks help to ensure that AI not only answers correctly but also ‘intelligently’ – an important distinction for better mastering complex tasks.

 

Limitations and potential of AI in everyday life: humorous anecdotes and their significance 

Besides constructive applications, there are also numerous absurd misinterpretations and misunderstandings that often make AI results seem humorous. Here are a few examples of such anecdotes (found, for example, on https://dataconomy.com/2024/06/03/funny-ai-fails-moments/):

 

  • Edible stones: An AI suggested eating a small stone every day to get important minerals. This recommendation caused a lot of amusement because it is completely misleading and dangerous to health.

  • Pizza glue: An AI recommended adding non-toxic glue to pizza to prevent the cheese from sliding off. While it may be a creative solution, it shows a lack of understanding of eating habits and safety.

  • Hamsters recognised as people: In a facial recognition system, a hamster was classified as a human face, calling into question the reliability of the technology.

  • Eight-day week for more fitness: An AI system suggested training eight days a week to stay fit. A ‘timeless’ recommendation that shows that even basic concepts like time are difficult for AI models to grasp.

  • Balenciaga Pope: An AI-generated image showing the Pope wearing a Balenciaga jacket went viral because it looked both realistic and completely surreal at first glance. Many people thought the picture was real until it became clear that it was AI-generated.


Such failures highlight the existing limitations of AI systems and show that the models often have difficulties correctly interpreting unusual contexts.


Social and ethical implications 

Such ‘humorous’ failures can influence public perception and trust in AI systems. Especially in safety-critical areas such as medicine or law, mistakes could have serious consequences. Therefore, the necessity of ‘explainable AI’ is increasingly being emphasised in order to make decisions comprehensible and understandable. A transparent explanation of the decision-making process can strengthen trust in AI and at the same time help to avoid misunderstandings or misinterpretations.

The responsibility for such standards also lies with the developers and providers. By introducing clear quality management processes and showing users how AI decisions are made, they help to make the use of AI safer and more comprehensible.


Additional perspectives for quality assessment 

Apart from systematic quality standards and ethical guidelines, structured data quality management is crucial. The quality of the data used directly influences the accuracy and relevance of the AI outputs and ensures that biases and errors are minimised. Regulatory measures such as the ‘EU AI Act’ provide initial indications of the importance of a reliable database (The legal texts | EU law on artificial intelligence).


Another aspect is training and raising awareness among users. A basic understanding of AI and how it works helps to avoid misunderstandings and to evaluate the technology more realistically. User training programmes have a key role to play in showing people how to see the anecdotal failures for what they are: curious exceptions that say nothing about the general quality of the AI.


Conclusion and outlook 

The anecdotes about tips for using AI and AI errors are just that: anecdotes. We need new evaluation methods (such as human-in-the-loop and context-specific quality metrics) to better align AI models with real-world needs and use them responsibly.


A humorous image on AI quality assessment: A robot with data charts observes a confused hamster in front of facial recognition, a pizza with glue, and a rock labeled as "food."

Anecdotal Evidence or Systematic AI Research – The Current Situation and What Still Needs to Be Done

AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI

Futuristic cosmic scene featuring the glowing number 42 at the center, surrounded by abstract technological and galactic elements.

What ‘42’ Teaches Us About Change Management and UX

AI & UXR, CHAT GPT, OPEN AI, UX

An abstract humanoid outline formed of handwritten notes, books, and flowing ink lines in soft pastel tones, surrounded by a cozy study environment.

Who Are We Talking To? How the Image of ChatGPT Influences Our Communication

AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI

Illustration of the Turing Test with a human and robotic face connected by chat symbols.

Why Artificial Intelligence Still Can’t Pass the Turing Test

AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI

two folded hands holding a growing plant

UX For a Better World: We Are Giving Away a UX Research Project to Non-profit Organisations and Sustainable Companies!

UX INSIGHTS, UX FOR GOOD, TRENDS, RESEARCH

Two humanoid robots in a futuristic studio reflecting on their existence. Dark atmosphere with a digital backdrop.

Does an AI Understand Its Own Existential Crisis?

AI & UXR, CHAT GPT

Several people laugh at an AI robot screen displaying "Error 404: Humor not found," surrounded by speech bubbles, coding symbols, and books about humor.

Does an AI Understand Jokes?

AI & UXR, CHAT GPT

A futuristic book with a glowing cover, surrounded by digital light streams and AI symbols like networks and binary code.

Everywhere All At Once – How AI is Changing Our World and What We Can Gain

AI & UXR, CHAT GPT, OPEN AI

Symbols for New Year's resolutions, motivation, and AI support.

Successfully Implement New Year’s Resolutions and Discover Personal Motivators With ChatGPT

AI & UXR

A brain, half sharp, half pixelated, symbolises remembering and forgetting. Subtle ChatGPT logo in the background.

Remembering and forgetting with ChatGPT - A guide for beginners

AI & UXR

A cozy Christmas table with a laptop, gifts, a cup of cocoa, and festive decorations, showcasing creativity and humor for the holidays.

The ‘Christmas Prompts’ - Practical & Fun Ideas for the Festive Season

AI & UXR, TRENDS

A futuristic humanoid figure made of glowing digital code fades into a neural network background, symbolizing AI and consciousness.

Hollywood’s as AIs vs ChatGPT: What Film AIs Have In Common With ChatGPT (And What They Don’t)

AI & UXR, CHAT GPT, HUMAN VS AI

Medieval image of a scholar with a scroll, surrounded by floating symbols representing errors and hallucinations.

Calculating With AI: A Story of Mistakes and Coincidences.

AI & UXR, OPEN AI, HUMAN VS AI

A dark, satanic-themed image featuring a menacing devil's head with horns, surrounded by gothic and occult symbols, including pentagrams and flames. The phrase 'The Devil is in the Details' appears in bold gothic font in the center, with red and black colors dominating the background.

Everything You Need to Know About Tokens, Data Volumes and Processing in ChatGPT

AI & UXR

Colourful image with typewriter, speech bubbles and pen representing different writing styles; background with fonts in varying typefaces for style diversity.

Your Own Writing Style and ChatGPT: A Guide to Proper Communication

AI & UXR

A women yells at a robot.

Being Nice Helps - Not Only With People, but Also With AI

AI & UXR

A face split down the middle with the left half being a robot and the right half a woman.

Male, Female, Neutral? On a Journey of Discovery With an AI - Of ‘Neutrality’ and Gender Roles

AI & UXR

A floating robot between many symbols of the English and German language.

German or English? How the Choice of Language Influences the Quality of AI Answers

AI & UXR

Image of a podcast cover on the topic of quality in UX research with two women on the cover.

Podcast: Why the quality of UX research can sometimes be a challenge

UX, UX INSIGHTS, UX QUALITY

Two people sitting at a table in a office in front of a laptop and discussing

Why User Research Is Essential: The Most Common Objections and How to Refute Them

UX INSIGHTS, STAKEHOLDER MANAGEMENT, OBJECTION HANDLING, ADVANTAGES USER RESEARCH

 RELATED ARTICLES YOU MIGHT ENJOY 

AUTHOR

Tara Bosenick

Tara has been active as a UX specialist since 1999 and has helped to establish and shape the industry in Germany on the agency side. She specialises in the development of new UX methods, the quantification of UX and the introduction of UX in companies.


At the same time, she has always been interested in developing a corporate culture in her companies that is as ‘cool’ as possible, in which fun, performance, team spirit and customer success are interlinked. She has therefore been supporting managers and companies on the path to more New Work / agility and a better employee experience for several years.


She is one of the leading voices in the UX, CX and Employee Experience industry.

bottom of page