TF-IDF Cosine Similarity Calculator — Score Two Texts From 0 to 1
Paste two texts and get the cosine similarity of their smoothed TF-IDF (N=2) vectors on a 0–1 scale. The total vocabulary, count of shared terms, and each vector norm sit on one panel, with toggles for case-insensitive folding and sublinear TF. Up to 20,000 characters per document.
💡 About this tool
When you are implementing search ranking or checking two articles for overlap, you often want to verify by hand whether two passages really are close in terms of how their words overlap. Before reaching for an embedding model, you may want the plain, lexical similarity first — that is what this calculator gives you.
TF-IDF weights each term by multiplying its term frequency (how often it appears in the document) by its inverse document frequency (how rare it is across documents). Frequent, distinctive words count for more, while words that appear in both documents are discounted. This tool treats your two inputs as two documents, builds the shared vocabulary, forms the TF-IDF vectors, and returns the cosine similarity to four decimal places.
With only two documents, the raw IDF of a shared word collapses to zero, so this tool uses the same smoothed formula as scikit-learn: IDF = ln((N+1)/(df+1)) + 1. That assigns a weight of 1.0 to shared terms and about 1.405 to terms in only one document. The total vocabulary, shared-term count, and norms are shown alongside so you can trace how the number was built.
🧐 Frequently asked questions
How should I read the cosine value? Closer to 1 means the two vectors point in the same direction and the texts are similar (angle 0°); 0 means there are no shared terms and they are unrelated (angle 90°). Around 0.5 indicates partial vocabulary overlap.
Why is there a +1 smoothing in the IDF? With only two documents, a word appearing in both would get a raw IDF of zero and lose its weight entirely. The scikit-learn default smoothing (+1 to numerator and denominator, +1 at the end) keeps a weight of 1.0 on shared terms so the computation stays stable.
What changes when I turn on sublinear TF? Instead of the raw count, the term frequency becomes 1 + ln(tf), which dampens the effect of a word repeated many times. It matters for long texts where one term recurs heavily; it mirrors scikit-learn's default option.
How are CJK and other scripts tokenized? CJK characters (Chinese, Japanese kana, Korean) are treated one character per token. Runs of letters, digits, apostrophe, hyphen, and underscore form a single word. No morphological analysis is applied, so use it as a lexical, bag-of-words measure of closeness.
What does case normalization affect? When on, ASCII letters are folded to lowercase, so "Cosine" and "cosine" count as the same term. CJK is unaffected. Turn it off when you need to distinguish capitalized proper nouns.
📚 Notes on TF-IDF and cosine
Since it was proposed in the 1970s, TF-IDF has remained a foundation of full-text search scoring. The default score in Elasticsearch and Lucene is BM25, an evolution of TF-IDF, but the underlying idea — frequency times rarity — is shared. Inspecting a plain TF-IDF cosine by hand builds intuition for why a search ranking comes out in a given order.
Cosine similarity ignores vector length and looks only at direction because the goal is to compare the "direction" of the vocabulary without being thrown off by how long each text is. A long and a short document can still score high if their word composition is similar. This tool also shows norm A and norm B, so you can see the magnitude of each document's TF-IDF vector — the accumulation of its terms and weights — at the same time.