BLEU Score (Bigram) Calculator | Score a translation against a reference
Paste a reference and a candidate to get a BLEU-2 score built from 1-gram and 2-gram precision, plus a brevity penalty that punishes translations shorter than the reference. Add-1 smoothing keeps short sentences off zero.
💡 About this tool
When you are ranking a handful of machine-translation or LLM outputs and just want to know which draft is closest to a reference, spinning up a full BLEU-4 pipeline is overkill. For a quick eyeball comparison, bigram precision usually tells you what you need.
This calculator is the lightweight, single-reference, bigram-capped version. As you type it returns p1 (the clipped fraction of matching words), p2 (the fraction of matching adjacent word pairs), the brevity penalty derived from the length ratio, and the combined BLEU-2, all with their numerator/denominator shown. The tokenizer lowercases input, splits Latin script on word boundaries, and treats CJK characters as single tokens, so mixed-language text does not break it.
Because a zero in p1 or p2 collapses the whole geometric mean to zero, short-sentence scoring uses add-1 (Laplace) smoothing — adding 1 to both the numerator and denominator of any zero precision — to dodge that cliff. That keeps a single-sentence comparison from snapping straight to 0.
🧐 Frequently Asked Questions
How is BLEU-2 different from the BLEU-4 in papers? Research-grade BLEU-4 averages 1-gram through 4-gram precision and usually assumes multiple references. This tool caps at 2-gram and uses one reference, so the numbers will not match — it is built for fast draft-to-draft comparison, not for reporting in a paper.
Why is my score lower than I expected? If the candidate is shorter than the reference, the brevity penalty drops below 1 and pulls everything down. And if adjacent word pairs do not line up, p2 falls — so getting the words right but the order wrong still hurts the score.
What counts as a "good" BLEU score? BLEU runs from 0 to 1, and only an exact match with the reference reaches 1. There is no universal pass mark; the metric is meant for ranking several candidates against the same reference, not for judging one output in isolation.
Does it handle non-English text? Yes. CJK characters are scored one character per token, so Japanese, Chinese, or Korean text works without word segmentation. Just note that character-level matching reads differently from word-boundary BLEU.
What do the r and c token counts mean? r is the reference token count and c is the candidate token count. When c is smaller than r the brevity penalty kicks in; when c ≥ r the penalty is 1. Comparing the two in the breakdown tells you whether a low score comes from length or from precision.
📚 Why BLEU rewards word order
BLEU was introduced in 2002 as one of the first metrics to automate machine-translation scoring, replacing slow human judgment with n-gram overlap and dramatically speeding up the system-improvement loop. The reason higher-order n-grams matter is fluency: unigram precision (p1) checks whether you picked the right words, while bigram precision (p2) checks whether you put them in a plausible order. A candidate can score well on p1 yet tank on p2 if the phrasing is scrambled, which is exactly why glancing at p1 and p2 separately — as this tool lays them out — is often more diagnostic than the single blended number.