Finding near-duplicates with Jaccard similarity and MinHash

from blog Posts on Made of Bugs, 3 Jul 2024 | ↗ original

Suppose we have a large collection of documents, and we wish you identify which documents are approximately the same as each other. For instance, we may have crawled the web over some period of time, and expect to have fetched the “same page” several times, but to see slight differences in metadata, or that we have several revisions of a page...

This is a short summary. ↗ Open original to view full content

New benchmarks for approximate nearest neighbors