
AI & UXR, CHAT GPT, HUMAN VS AI, OPEN AI
Anecdotal Evidence or Systematic AI Research – The Current Situation and What Still Needs to Be Done
4
MIN
Mar 4, 2025
When interacting with AI systems such as ChatGPT, Gemini or Claude, there is certainly valuable anecdotal evidence that can help users to make communication more efficient. One example is the finding that many models respond better in English because they can handle complex queries more precisely in this language. This is often because they were trained predominantly on English-language data sets, leading to richer language comprehension. For users, this means that a query can sometimes be improved by switching to English, especially for very technical or scientific topics (see blog article).
Another example of constructive anecdotal evidence is asking follow-up questions. Since AI models sometimes tend to overlook important aspects or generalise, it helps to ask the AI specific and multiple follow-up questions for details. A dialogue in several steps can often help to obtain all relevant information and close knowledge gaps. This form of follow-up question has proven effective in practice for ensuring that the answer really does cover all the desired aspects (see blog article).
Likewise, tone is not insignificant in interactions with AI. Studies on human-AI communication show that a polite and structured form of address often leads to better structured answers – possibly because the model interprets politeness as ‘serious interest’ and thus focuses more specifically on the question asked. Anecdotes from everyday life also show that respectful language improves the flow of conversation and reduces misunderstandings (see blog article).
But while there are these positive ‘application anecdotes’ that can help users to improve the quality of AI results, there are also numerous bizarre and absurd failures that raise questions about the quality and reliability of AI. The humorous aspect of these anecdotes ensures that such failures often attract a lot of attention and may go viral, e.g. the question of how many ‘r’s are in the word ‘Strawberry’ and ChatGPT's inability to answer this question correctly (see blog article). Unfortunately, however, they obscure the need to develop systematic quality assessment standards for AI – a topic that is still in its early stages in scientific research.
State of research on systematic quality assessment
In scientific research, there are initial approaches to developing systematic methods for assessing the quality of AI results. The Multidimensional Quality Metrics (MQM) approach offers a way to evaluate translations or generated text based on specific categories such as coherence, grammar and relevance. Projects such as the AIGIQA-20K database, available at CVPR 2024 Open Access Repository, take this approach to gain more systematic insights into the quality of AI-generated images (CVF Open Access).
Another example of systematic evaluation approaches are frameworks such as TruEra, which, in addition to technical quality standards, also take into account ethical and societal criteria such as fairness and transparency. This combined perspective helps to assess the quality and social acceptability of AI outputs more comprehensively, as can be read about at Truerab.
Regulatory frameworks, such as the one emerging with the EU AI Act, also indicate an increasing recognition of the need to make AI systems reliable and transparent. These regulations are intended to ensure, particularly in safety-critical areas such as healthcare, that AI outputs are accurate and verifiable and that users can trust the systems.
Challenges and opportunities for better AI results
However, the road to a truly comprehensive quality assessment remains complex. Context-dependent adaptations and continuous maintenance are indispensable when evaluating AI outputs. A simple, standardised metric often does not do justice to the diversity of application scenarios. To achieve a realistic evaluation, researchers are therefore experimenting with hybrid models that combine human control – as in ‘human-in-the-loop’ approaches – and machine evaluations. Particularly in sensitive areas such as medicine, it has been shown that such a combination delivers more reliable and trustworthy results.
Another example of the importance of context-specific metrics is the BIG Bench project, which tests language models for creative abilities and logical thinking in addition to pure accuracy (ProjectPro). Such sophisticated benchmarks help to ensure that AI not only answers correctly but also ‘intelligently’ – an important distinction for better mastering complex tasks.
Limitations and potential of AI in everyday life: humorous anecdotes and their significance
Besides constructive applications, there are also numerous absurd misinterpretations and misunderstandings that often make AI results seem humorous. Here are a few examples of such anecdotes (found, for example, on https://dataconomy.com/2024/06/03/funny-ai-fails-moments/):
Edible stones: An AI suggested eating a small stone every day to get important minerals. This recommendation caused a lot of amusement because it is completely misleading and dangerous to health.
Pizza glue: An AI recommended adding non-toxic glue to pizza to prevent the cheese from sliding off. While it may be a creative solution, it shows a lack of understanding of eating habits and safety.
Hamsters recognised as people: In a facial recognition system, a hamster was classified as a human face, calling into question the reliability of the technology.
Eight-day week for more fitness: An AI system suggested training eight days a week to stay fit. A ‘timeless’ recommendation that shows that even basic concepts like time are difficult for AI models to grasp.
Balenciaga Pope: An AI-generated image showing the Pope wearing a Balenciaga jacket went viral because it looked both realistic and completely surreal at first glance. Many people thought the picture was real until it became clear that it was AI-generated.
Such failures highlight the existing limitations of AI systems and show that the models often have difficulties correctly interpreting unusual contexts.
Social and ethical implications
Such ‘humorous’ failures can influence public perception and trust in AI systems. Especially in safety-critical areas such as medicine or law, mistakes could have serious consequences. Therefore, the necessity of ‘explainable AI’ is increasingly being emphasised in order to make decisions comprehensible and understandable. A transparent explanation of the decision-making process can strengthen trust in AI and at the same time help to avoid misunderstandings or misinterpretations.
The responsibility for such standards also lies with the developers and providers. By introducing clear quality management processes and showing users how AI decisions are made, they help to make the use of AI safer and more comprehensible.
Additional perspectives for quality assessment
Apart from systematic quality standards and ethical guidelines, structured data quality management is crucial. The quality of the data used directly influences the accuracy and relevance of the AI outputs and ensures that biases and errors are minimised. Regulatory measures such as the ‘EU AI Act’ provide initial indications of the importance of a reliable database (The legal texts | EU law on artificial intelligence).
Another aspect is training and raising awareness among users. A basic understanding of AI and how it works helps to avoid misunderstandings and to evaluate the technology more realistically. User training programmes have a key role to play in showing people how to see the anecdotal failures for what they are: curious exceptions that say nothing about the general quality of the AI.
Conclusion and outlook
The anecdotes about tips for using AI and AI errors are just that: anecdotes. We need new evaluation methods (such as human-in-the-loop and context-specific quality metrics) to better align AI models with real-world needs and use them responsibly.
RELATED ARTICLES YOU MIGHT ENJOY
AUTHOR
Tara Bosenick
Tara has been active as a UX specialist since 1999 and has helped to establish and shape the industry in Germany on the agency side. She specialises in the development of new UX methods, the quantification of UX and the introduction of UX in companies.
At the same time, she has always been interested in developing a corporate culture in her companies that is as ‘cool’ as possible, in which fun, performance, team spirit and customer success are interlinked. She has therefore been supporting managers and companies on the path to more New Work / agility and a better employee experience for several years.
She is one of the leading voices in the UX, CX and Employee Experience industry.
