What is Token?
Token a token is the basic unit of text that AI language models process. Text is split into tokens—which might be words, parts of words, or characters—before being fed to the model. Token counts determine context limits, costs, and processing time.
On this page
What is Token?
Tokens are how language models see text. Before processing, text is split into tokens using a tokenizer—a component that converts raw text into a sequence of token IDs that the model can process. Tokens might be whole words ('hello' = 1 token), word pieces ('uncomfortable' = 'un' + 'comfort' + 'able' = 3 tokens), or individual characters. Different models use different tokenizers with different vocabularies. In English, a rough approximation is 1 token ≈ 4 characters or ¾ of a word, but this varies significantly by language and content type. Code, technical terms, and non-English languages often require more tokens per word.
How Token Works
Tokenizers use algorithms (like BPE—Byte Pair Encoding) to build vocabularies from training data. Common sequences become single tokens; rare sequences are split into smaller pieces. This balances vocabulary size (smaller is more efficient) with sequence length (longer sequences are more expensive). At inference time, the tokenizer converts input text to token IDs, the model processes these IDs, and then token IDs are converted back to text. Each token is represented as a vector (embedding) that the model processes. The model has no concept of 'characters' or 'words'—only tokens. This is why models sometimes struggle with character-level tasks like counting letters.
Why Token Matters
Tokens determine fundamental constraints of AI systems. Context windows are measured in tokens—a 128K context window can process about 96,000 words. API pricing is per token—understanding tokenization helps estimate costs. Response length limits are in tokens. Generation speed depends on token count. For developers, understanding tokenization explains behaviors like why models struggle with certain tasks (character manipulation), how to optimize prompts (fewer tokens = faster/cheaper), and why different languages have different effective context limits.
Examples of Token
The sentence 'Hello, how are you?' might tokenize as ['Hello', ',', ' how', ' are', ' you', '?']—6 tokens. A code snippet might tokenize differently: 'function()' could be ['function', '(', ')']—3 tokens. GPT-4 uses about 100K tokens for its vocabulary. Claude's tokenizer handles multiple languages but English is most efficient. Long technical terms or unusual words split into multiple tokens, consuming more of your context budget.
Common Misconceptions
Tokens aren't words—a word might be 1-5+ tokens depending on the word and tokenizer. Another misconception is that all tokenizers are the same; different models use different tokenization, so token counts vary. Tokens aren't characters either—they're an intermediate representation. The same text can have very different token counts in different languages or domains.
Key Takeaways
- 1Token is a fundamental concept in building AI that maintains persistent relationships with users.
- 2Understanding token is essential for developers building relational AI, companions, or any AI that benefits from knowing its users.
- 3Promitheus provides infrastructure for implementing token and other identity capabilities in production AI applications.
Written by the Promitheus Team
Part of the AI Glossary · 50 terms
Build AI with Token
Promitheus provides the infrastructure to implement token and other identity capabilities in your AI applications.