🤗 TokenizerBench

Evaluate and compare tokenizers on multilingual text, code, scientific formulas, and edge cases. Load from the Hugging Face Hub, upload your own files, or use a tiktoken encoding.

Type or paste any text and see instant tokenization results.

Load tokenizer

Source

Examples: xlm-roberta-base · google/mt5-base · facebook/mbart-large-50 · ai4bharat/indic-bert

Results will appear here.


Browse dataset samples

Click any sample below to load it into the text box above.


Metrics explained — Fertility = tokens/word (lower = better, ≥4 = poor) · Compression = tokens/char · Fidelity = encode→decode must reproduce original text exactly

Upload guide

File(s) to upload Tokenizer type
tokenizer.json Any HuggingFace fast tokenizer (BERT, RoBERTa, GPT-2, LLaMA…)
tokenizer.json + tokenizer_config.json + vocab.txt Full HF tokenizer folder
vocab.json + merges.txt BPE tokenizer (GPT-2 style)
*.model SentencePiece (T5, mT5, XLM-R, mBERT…)