🤗 TokenizerBench

Evaluate and compare tokenizers on multilingual text, code, scientific formulas, and edge cases. Load from the Hugging Face Hub, upload your own files, or use a tiktoken encoding.

Type or paste any text and see instant tokenization results.

Load tokenizer

Source

HuggingFace Hub ID Upload files tiktoken encoding

Hub model ID

Examples: xlm-roberta-base · google/mt5-base · facebook/mbart-large-50 · ai4bharat/indic-bert

Input text

Results will appear here.

Token IDs

Browse dataset samples

Click any sample below to load it into the text box above.

Metrics explained — Fertility = tokens/word (lower = better, ≥4 = poor) · Compression = tokens/char · Fidelity = encode→decode must reproduce original text exactly

Upload guide

File(s) to upload	Tokenizer type
`tokenizer.json`	Any HuggingFace fast tokenizer (BERT, RoBERTa, GPT-2, LLaMA…)
`tokenizer.json` + `tokenizer_config.json` + `vocab.txt`	Full HF tokenizer folder
`vocab.json` + `merges.txt`	BPE tokenizer (GPT-2 style)
`*.model`	SentencePiece (T5, mT5, XLM-R, mBERT…)