Finding near-duplicates with Jaccard similarity and MinHash

from blog Posts on Made of Bugs, | ↗ original
Suppose we have a large collection of documents, and we wish you identify which documents are approximately the same as each other. For instance, we may have crawled the web over some period of time, and expect to have fetched the “same page” several times, but to see slight differences in metadata, or that we have several revisions of a page...