Tokens are a big reason today's generative AI falls short

Kyle Wiggers

July 6, 2024 at 12:00 PM·5 min read

Generative AI models don't process text the same way humans do. Understanding their "token"-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

Most models, from small on-device ones like Gemma to OpenAI's industry-leading GPT-4o, are built on an architecture known as the transformer. Due to the way transformers conjure up associations between text and other types of data, they can't take in or output raw text — at least not without a massive amount of compute.

So, for reasons both pragmatic and technical, today's transformer models work with text that's been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

Tokens can be words, like "fantastic." Or they can be syllables, like "fan," "tas" and "tic." Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., "f," "a," "n," "t," "a," "s," "t," "i," "c").

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode "once upon a time" as "once," "upon," "a," "time," for example, while encoding "once upon a " (which has a trailing whitespace) as "once," "upon," "a," " ." Depending on how a model is prompted — with "once upon a" or "once upon a ," — the results may be completely different, because the model doesn't understand (as a person would) that the meaning is the same.

Tokenizers treat case differently, too. "Hello" isn't necessarily the same as "HELLO" to a model; "hello" is usually one token (depending on the tokenizer), while "HELLO" can be as many as three ("HE," "El," and "O"). That's why many transformers fail the capital letter test.

"It's kind of hard to get around the question of what exactly a 'word' should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to 'chunk' things even further," Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. "My guess would be that there's no such thing as a perfect tokenizer due to this kind of fuzziness."

This "fuzziness" creates even more problems in languages other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. That's because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don't — nor do Korean, Thai or Khmer.

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less "token-efficient" languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for "hello" in Thai, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

Beyond language inequities, tokenization might explain why today's models are bad at math.

Rarely are digits tokenized consistently. Because they don't really know what numbers are, tokenizers might treat "380" as one token, but represent "381" as a pair ("38" and "1") — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

That's also the reason models aren't great at solving anagram problems or reversing words.

https://twitter.com/karpathy/status/1759996551378940395

So, tokenization clearly presents challenges for generative AI. Can they be solved?

Maybe.

Feucht points to "byte-level" state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling "noise" like words with swapped characters, spacing and capitalized characters.

Models like MambaByte are in the early research stages, however.

"It's probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers," Feucht said. "For transformer models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations."

Barring a tokenization breakthrough, it seems new model architectures will be the key.

TechCrunch
CIOs' concerns over generative AI echo those of the early days of cloud computing
When I attended the MIT Sloan CIO Symposium in May, it struck me that as I listened to CIOs talking about the latest technology — in this case generative AI — I was reminded of another time at the same symposium in around 2010 when the talk was all about the cloud. It was notable how similar the concerns over AI were to the ones that I heard about the fledgling cloud all those years ago: Companies were concerned about governance (check), security (check) and responsible use of a new technology (check). Today, CIOs recognize if they just say no to generative AI, employees are probably going to find a way to use these tools anyway.
Engadget
Artists criticize Apple's lack of transparency around Apple Intelligence data
But some members of the creative community are unhappy about what they say is the company’s lack of transparency around the raw information powering the AI model that makes Apple Intelligence possible.
Engadget
Apple seems to have persuaded OpenAI to work for exposure
Apple is not paying OpenAI, because it believes that putting its technology in front of hundreds of millions of users is equal to or even better than any kind of monetary payment, according to Bloomberg.
TechCrunch
France leads the pack for generative AI funding in Europe
Like it or hate it, artificial intelligence — especially generative AI — is the technology story of 2024. Together, Europe and Israel typically make up some 45% of all venture funding annually, yet when you translate that to the specific sphere of AI, the proportion drops to less than half of that -- and generative AI even less. You can take that as a signal that Europe and Israel are lagging in the market.
TechCrunch
Amazon says it'll spend $230 million on generative AI startups
Amazon says that it will commit up to $230 million to startups building generative AI-powered applications. The investment, roughly $80 million of which will fund Amazon's second AWS Generative AI Accelerator program, aims to position AWS as an attractive cloud infrastructure choice for startups developing generative AI models to power their products, apps and services. Much of the new tranche -- including the entire portion set aside for the accelerator program -- comes in the form of compute credits for AWS infrastructure, meaning that it can't be transferred to other cloud service providers like Google Cloud and Microsoft Azure.
TechCrunch
AT&T says criminals stole phone records of 'nearly all' customers in new data breach
U.S. phone giant AT&T confirmed Friday it will begin notifying millions of consumers about a fresh data breach that allowed cybercriminals to steal the phone records of "nearly all" of its customers, a company spokesperson told TechCrunch. In a statement, AT&T said that the stolen data contains phone numbers of both cellular and landline customers, as well as AT&T records of calls and text messages — such as who contacted who by phone or text — during a six-month period between May 1, 2022 and October 31, 2022. AT&T said some of the stolen data includes more recent records from January 2, 2023 for a smaller but unspecified number of customers.
Yahoo Finance
Stock market today: US futures steady as big bank earnings roll in
Investors are bracing for Wall Street banks to kick off earnings season as they debate how deep the Fed could cut rates.
Engadget
The Morning After: Hydrogen-powered air taxi completes 523-mile test
The biggest news stories this morning: How false nostalgia inspired noplace, a Myspace-like app for Gen Z, PS Plus members can try the Overwatch-like Concord this weekend, Ford revives the Capri after 30 years as a four-door EV..
Yahoo Personal Finance
CD rates today, July 12, 2024 (up to 5.40% return)
CD rates are the highest they’ve been in over a decade. Here’s where you can find the best CD rates today.
Yahoo Personal Finance
Mortgage rates today, July 12, 2024: Buyers start to see lower rates and more inventory
These are today's mortgage rates. Slowing inflation is pushing rates down, and inventory is picking up as summer is in full swing. Lock in your rate today.
Yahoo Personal Finance
Savings interest rates today, July 12, 2024 (up to 5.50% return)
Savings interest rates are the highest they’ve been in over a decade. Here’s where you can find the best savings interest rates today.
Autoblog
Ford Raptor T1+ debuts at Goodwood with sights set on the Dakar Rally
This wild-looking off-roader is the Ford Raptor T1+, and it will compete at the Dakar Rally.
Yahoo News
What is Project 2025? A look inside the conservative policy proposal making waves
What is Project 2025? Incendiary comments from a prominent think tank's leader has thrust the controversial policy proposal into the spotlight.
Yahoo Life Shopping
Today's best sales: A $30 sandwich maker, plus Bissell, Crest, Sonicare and more
Also on deck? An Insignia 4K smart TV for the lowest price we've seen this year and bestselling Samsonite luggage for 35% off.
TechCrunch
Watch a robot navigate the Google DeepMind offices using Gemini
Generative AI has already shown a lot of promise in robots. Applications include natural language interactions, robot learning, no-code programming and even design. Google’s DeepMind Robotics team this week is showcasing another potential sweet spot between the two disciplines: navigation.
Yahoo Sports
Duke's Cooper Flagg named Gatorade Best Male Player of the Year
It's been quite a week for Flagg, who put the entire basketball world on notice during a Team USA scrimmage in Las Vegas.
TechCrunch
Waymo cameras capture footage of person charged in alleged robotaxi tire slashings
A Castro Valley resident was charged Thursday for allegedly slashing the tires of 17 Waymo robotaxis in San Francisco between June 24 and June 26, according to the city's district attorney. Prosecutors say the tire slashings were captured by cameras installed on the exterior of Waymo's robotaxis. This is the latest incident of Waymo vandalism in the Bay Area, where some residents have expressed frustration with the autonomous vehicles.
Yahoo Personal Finance
Physician mortgage loans: How they work and who qualifies
Physician mortgage loans help medical professionals afford mortgages even with high levels of student debt. Learn whether you qualify for a physician loan.
Yahoo Life
Window falls are highest in July. Here are 8 tips for staying safe.
Window falls — which can happen at home, while visiting family or while traveling — can cause serious injuries. They can also be prevented.
Yahoo Tech
19 best early Prime Day tech deals: Apple, Beats, HP, Acer, Lenovo and more
Record-low prices abound, from AirPods for only $69 (!!) to the Amazon Fire TV Stick for $18 — it's 55% off.

News

Life

Entertainment

Finance

Sports

New on Yahoo

Tokens are a big reason today's generative AI falls short

Recommended Stories

CIOs' concerns over generative AI echo those of the early days of cloud computing

Artists criticize Apple's lack of transparency around Apple Intelligence data

Apple seems to have persuaded OpenAI to work for exposure

France leads the pack for generative AI funding in Europe

Amazon says it'll spend $230 million on generative AI startups

AT&T says criminals stole phone records of 'nearly all' customers in new data breach

Stock market today: US futures steady as big bank earnings roll in

The Morning After: Hydrogen-powered air taxi completes 523-mile test

CD rates today, July 12, 2024 (up to 5.40% return)

Mortgage rates today, July 12, 2024: Buyers start to see lower rates and more inventory

Savings interest rates today, July 12, 2024 (up to 5.50% return)

Ford Raptor T1+ debuts at Goodwood with sights set on the Dakar Rally

What is Project 2025? A look inside the conservative policy proposal making waves

Today's best sales: A $30 sandwich maker, plus Bissell, Crest, Sonicare and more

Watch a robot navigate the Google DeepMind offices using Gemini

Duke's Cooper Flagg named Gatorade Best Male Player of the Year

Waymo cameras capture footage of person charged in alleged robotaxi tire slashings

Physician mortgage loans: How they work and who qualifies

Window falls are highest in July. Here are 8 tips for staying safe.

19 best early Prime Day tech deals: Apple, Beats, HP, Acer, Lenovo and more