AI & UXR, TOKEN, LLM

Why AI Sometimes Can’t Count to 3 – And What That Has to Do With Tokens

MIN

Sep 25, 2025

When AI miscounts

‘How many rs are there in strawberry?’ – a simple question, right?

Not for language models. For a long time, the common AI answer was ‘2’. Anyone who counts along quickly realises that this is wrong. Strawberry contains three rs – clearly.

It's a mistake that seems so absurdly simple that it makes you wonder: how can a highly developed language model like ChatGPT fail at this?

The answer takes us to the heart of language model architecture – more specifically, to the world of tokenisation. And that's more fascinating than it seems at first glance.

What is actually happening here?

Language models such as ChatGPT do not really count. Nor do they analyse letters, at least not in the way we would.

Instead, they break down text into so-called tokens – smaller units with which the model was trained. Tokens can be an entire word, part of a word or even just a syllable.

And here's the crux of the matter: ‘r’ is not a token.

The model does not “see” the letter individually, but embedded in larger text segments. It only recognises the ‘r’ if explicitly prompted to do so – and even then, it may be wrong depending on the prompt and context.

This is because language models do not work deterministically, but probabilistically: they guess what is probably meant – and not necessarily what would be mathematically correct.

Tokenisation using the example of ‘strawberry’

The word strawberry, for example, is broken down into exactly two tokens by GPT-4o:

[“straw”, ‘berry’]

This means that the model recognises strawberry as two typical word components. And the ‘r’? It is contained in both tokens – but never in isolation. The language model does not count letters, but probability-based clusters of meaning.

So anyone who asks how many rs there are in strawberry is asking an accounting question of a semantic probability model. No wonder it often got it wrong in the past.

Even more exciting: German words

German language, difficult tokens: our beloved compound words are a real stress test for tokenisers. But surprisingly, GPT-4o doesn't do too badly here:

Word	Token	Number
Herausforderung (Challenge)	["Hera", "us", "ford", "er", "ung"]	5
Krankenhausaufenthalt (Hospital stay)	["Kranken", "haus", "auf", "ent", "halt"]	5
Datenschutzgrundverordnung (General Data Protection Regulation)	["Datenschutz", "grund", "ver", "ord", "nung"]	5
Arbeitszeiterfassungspflicht (Obligation to record working hours)	["Arbeits", "zeit", "er", "fass", "ungs", "pflicht"]	6
Selbstverständlichkeit (Matter of course)	["Selbst", "ver", "ständ", "lich", "keit"]	5

This shows that the tokeniser recognises many meaningful units – such as ‘-ity’, ‘-ity’, “self”, ‘evident’ – and splits compounds in a semantically intelligent way.

But here, too, the same applies: no model counts letters. It recognises, processes and combines tokens. Only if it is additionally trained or prepared for this with an example can it count correctly.

Why this is important (also for UX & prompting)

These small counting errors tell a big story – about the nature of language models.

Language models do not calculate; they probabilistically rank language. They can be brilliant at leaps in meaning and nuances, but they can also be completely off the mark when it comes to simple structural questions – such as counting letters, alphabetical sorting or mathematical sequences.

When it comes to prompting, this means:

If precision is important (e.g. for counting, formatting, extraction) → formulate tasks very clearly
If hallucinations are to be avoided → provide examples
If UX researchers work with AI → keep the behaviour of the tokeniser in mind

Because many supposed ‘errors’ are actually consequences of the architecture. And once you understand that, you can write much better prompts – and get better results.

Conclusion: Tokens are the new semantics – or the new tripwire

The error with the “r” in strawberry is not a trivial bug – it is an invitation to better understand language models.

Anyone who works with AI should be aware that:

AI does not understand letters – it understands tokens.
AI does not count – it estimates probabilities.
AI is not stupid – it is just trained differently.

Those who know this are less likely to stumble over simple tasks – and get more out of complex prompts.

Bonus: Try it yourself

🔧 Tool tip

If you want to try for yourself how words are broken down into tokens, you can use this OpenAI tool, for example:

👉 https://platform.openai.com/tokenizer

🧠 Prompt tip for counting

‘Please count exactly how many times the letter “r” appears in the following word: strawberry. Just give me the number.’

📣 Participatory question

How many s are there in ‘Mississippi’?

💌 Not enough? Then read on – in our newsletter. It comes four times a year. Sticks in your mind longer. To subscribe: https://www.uintent.com/newsletter

A referee holds up a scorecard labeled “Yupp.ai” between two stylized AI chatbots in a boxing ring – a symbolic image for fair user-based comparison of AI models.

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

AI & UXR, CHAT GPT, HUMAN VS AI, LLM

AI & UXR, TOKEN, LLM

Why AI Sometimes Can’t Count to 3 – And What That Has to Do With Tokens

​

When AI miscounts

What is actually happening here?

Tokenisation using the example of ‘strawberry’

Why this is important (also for UX & prompting)

Conclusion: Tokens are the new semantics – or the new tripwire

Bonus: Try it yourself

How Yupp Uses Feedback to Fairly Evaluate AI Models – And What UX Professionals Can Learn From It

Buying, sharing, selling prompts – what prompt marketplaces offer today (and why this is relevant for UX)

ChatGPT Hallucinates – Despite Anti-Hallucination Prompt

Why AI Sometimes Can’t Count to 3 – And What That Has to Do With Tokens

GPT-5 Is Here: Does This UX AI Really Change Everything for Researchers?

When AI Paints Pictures – And Suddenly Knows How to Spell

When the Text Is Too Smooth: How to Make AI Language More Human

Not Science Fiction – AI Is Becoming Independent

Between Argument and Influence – How Persuasive Can AI Be?

Digital Health Apps & Interfaces: Why Good UX Determines Whether Patients Really Benefit

Censorship Meets AI: What Deepseek Is Hiding About Human Rights – And Why This Affects UX

What It Takes to Get It Right: Global Study Logistics in UX Research for Medical Devices

Propaganda Chatbots - When AI Suddenly Speaks Russian

Welcome to the Prompt Zoo

UX Regulatory Compliance: Why Usability Drives Medtech Certification

Why Prompts That Produce Bias and Hallucinations Can Sometimes Be Helpful

Global UX Research in Medical Technology: International User Research as a Factor for Success

AI, Bias and the Power of Questions: How to Get Better Answers With Smart Prompts

Automate UX? Yes, Please! Why Zapier and n8n Are Real Super Tools for UX Teams

Surgical Robotics and UX: Why Usability Is Key to or Success

RELATED ARTICLES YOU MIGHT ENJOY

AUTHOR