8B pages == 8Billion pages
Title: Detecting Near-Duplicates for Web Crawling
WWW 2007, Track: Data Mining. Session: Similarity Search
Keywords: Hamming Distance, near-duplicate
Problem: the problem of elimination of near-duplicate web documents in a generic crawl has not received attention. Exact duplicates: checksumming techniques. Near-duplicate documents face more difficult problem. Irrelevant for web search: advertisements, counters, and timestamps
Challenges: 1, billions of web-pages; 2, decision to mark a newly-crawled page as a near-duplicate of an existing page should be made quickly; 3, the system should use as few machines as possible.
Contributions: 1, show that Charikar’s simhash useful for identifying near-duplicates in web documents belonging to a multi-billion page repository;
2, develop a technique for solving the Hamming Distance Problem;
3, present a survey of algorithms and techniques for duplicate detection.
Charikar’s simhash: It is applied to web-pages as follows: we first convert a web-page into a set of features, each feature tagged with its weight. A set of weighted features constitutes a high-dimensional vector. With simhash, we can transform such a high-dimensional vector into an f-bit fingerprint where f is small, say 64.
Properties of simhash: (A) The fingerprint of a document is a “hash” of its features, and (B) Similar documents have similar hash values. The latter property is atypical of hash functions like SHA-1 or MD5.
The hamming distance problem: The total number of probes is prohibitively large: for 64-bit fingerprints and k = 3, we need 41664 probes.
Develop a practical algorithm: it is possible to solve the problem with a small number of probes and by duplicating the table of fingerprints by a small factor.
Duplicate detection, a survey: A variety of techniques have been developed to identify pairs of documents that are “similar” to each other. These differ in terms of the end goal, the corpus under consideration, the feature-set identified per document and the signature scheme for compressing the feature-set.
[reference]: 1. About simhash algorithm: http://hi.baidu.com/xun1573/item/595e49c0be74b32fee466597
2. Detecting Near-Duplicates for Web Crawling