🤗 TokenizerBench
Evaluate and compare tokenizers on multilingual text, code, scientific formulas, and edge cases. Load from the Hugging Face Hub, upload your own files, or use a tiktoken encoding.
Type or paste any text and see instant tokenization results.
Load tokenizer
Examples: xlm-roberta-base · google/mt5-base · facebook/mbart-large-50 · ai4bharat/indic-bert
Results will appear here.
Browse dataset samples
Click any sample below to load it into the text box above.
Metrics explained — Fertility = tokens/word (lower = better, ≥4 = poor) · Compression = tokens/char · Fidelity = encode→decode must reproduce original text exactly
Upload guide
| File(s) to upload | Tokenizer type |
|---|---|
tokenizer.json |
Any HuggingFace fast tokenizer (BERT, RoBERTa, GPT-2, LLaMA…) |
tokenizer.json + tokenizer_config.json + vocab.txt |
Full HF tokenizer folder |
vocab.json + merges.txt |
BPE tokenizer (GPT-2 style) |
*.model |
SentencePiece (T5, mT5, XLM-R, mBERT…) |