search

Found

info Overview

See the Jaccard index and distance for two texts. Switch word, 2-gram, and 3-gram tokenization with case folding for plagiarism and duplicate checks.

📘 How to Use

  1. Paste the two texts you want to compare into Text A and Text B
  2. Pick a tokenization mode: word, 2-gram, or 3-gram
  3. Toggle case sensitivity on or off
  4. Read the Jaccard index, distance, and the set-size breakdown

Text Jaccard Similarity Calculator

Tokens: 0
Tokens: 0

Word mode splits on whitespace and punctuation. 2-gram / 3-gram use sliding characters and work well for short or CJK text

Case-insensitive

Whether A and a count as the same token

Jaccard index

/ 1.0

Jaccard distance

/ 1.0

Set breakdown

|A|
|B|
|A ∩ B|
|A ∪ B|
Copied!

※ Jaccard index J(A,B) = |A ∩ B| / |A ∪ B| and distance d = 1 − J. We define J = 0 when both sets are empty

※ Word mode extracts tokens by Unicode letter / digit / Han / Kana / Hangul ranges

Article

Text Jaccard Similarity Calculator | Word and Character N-gram Overlap

Turn two texts into sets and compute the Jaccard index J = |A ∩ B| / |A ∪ B| and its distance 1 − J. Switch between word, 2-gram, and 3-gram tokenization and case handling on the spot.

💡 Measure overlap, ignore order

When you want to know how alike two passages are, edit-distance metrics that walk character by character tend to miss copy-paste that simply shuffled the word order. The Jaccard index treats each text as a set of elements and asks only one thing: how much of the combined vocabulary do they share? That set-based view is the workhorse behind plagiarism screens, duplicate-article sweeps, and near-duplicate filtering in search indexes. It is precise at catching exact and near-exact overlap, but by design it shrugs at paraphrase.

This calculator gives you three ways to build the sets. Word mode splits on whitespace and punctuation, keeping Unicode letters, digits, and Han/Kana/Hangul runs as tokens. The 2-gram and 3-gram modes slide across the raw characters, so they stay meaningful for short snippets and for CJK text that has no spaces to split on. Flip the same pair through all three modes and you can watch how the overlap shifts between the word level and the character level on a single screen.

🧐 Frequently Asked Questions

How is Jaccard different from cosine similarity? Jaccard is set-based and only cares whether a token is present, which makes it precise and resistant to false positives on near-exact matches. Cosine similarity vectorizes term frequency and measures the angle between documents, so it picks up paraphrase and frequency skew. A common rule of thumb: reach for Jaccard when you want precision, and cosine when you need recall that survives rewording.

Why does word mode return 1 token for my CJK text? Word mode splits on whitespace and punctuation, so a language without spaces collapses into a single token. For Chinese, Japanese, or Korean, switch to 2-gram or 3-gram so the set is built character by character and the score becomes meaningful.

What does Jaccard distance tell me? Distance is 1 − index and reports how different the two texts are on a 0–1 scale. A higher index means more alike; a higher distance means further apart. When both sets are empty, the index is defined as 0.

Should I use 2-gram or 3-gram? For short inputs or one-to-two-word comparisons, 2-gram produces more elements and surfaces small differences. For longer text, 3-gram cuts noise and stays steadier. Trying both and watching the index move is the safest call.

📚 From botany to text mining

The coefficient comes from Swiss botanist Paul Jaccard, who built it to measure the share of plant species two regions hold in common. What started as an ecological tool for asking "how alike are the floras of two sites?" carries straight over to text once you treat a passage as a set of words or n-grams. The reason this single "shared over total" ratio shows up everywhere — from comparing genome sequences to matching user tastes in recommender systems — is exactly that elemental simplicity: it does not care what the elements are.