Detecting similar Strings or Sentences are very different from similar Files or Documents
There are two main categories of algorithms for matching string similarity: equivalence methods and similarity ranking methods.
Equivalence methods compare two strings and return a value of true or false depending on whether the two strings look equivalent. For example, ‘general-purpose’ is equivalent to ‘general purpose’.
Word Stemming: It is a technique to reduce closely related words to a basic canonical form or stem. For example ‘runs’, and ‘running’ can be reduced to ‘run’ – the basic stem for these words, before the algorithm begins performing an exact match. Similarly, ‘’familiarise’ and ‘familiarize’ are the same words in different dialects. Read more on stemming on Wikipedia; and you can download open-source porter-stemmers for many of the languages from Snowball.
Synonyms/Abbreviations: A technique to reduce/expand common abbreviations and replacing words that may be common synonyms of others. For example, ‘ads is very easily interchangeably used for ‘advertisements’. Doing so close to the user interface allows the back-end implementations to work on a smaller-controlled vocabulary, resulting in faster retrieval systems.
Wildcards & Regular Expressions: The popularity of wildcards and regular expressions (sounds a little geeky), has gained entrance in the field of string similarity. In this case, the input ‘go fish’ would become ‘*go* *fish*’, which matches ‘gone fishing’ as well as ‘go fishing’.
Soundex Algorithm: A phonetic algorithm for indexing names by sound, as pronounced in English. For example, ‘tame’, ‘fame’, and ‘’name’ all sound the same; but are not the same words. On the other hand, ‘licence’ and ‘license’ are. It is one of the most widely known phonetic algorithms, being a standard part of MySQL and ORACLE.
Similarity ranking methods compare a given string to a given set of other strings and then rank the given set in the order of their similarity with the original string. Ranking is based on how one string matches better than the other string.
Longest Common Substring: To find the longest string that is a substring of both the strings.
Edit Distance: The idea is to account for common character omissions, insertions, substitutions and reversals. The algorithms compute the number of string operations that are needed to convert one string in to another. For example, to convert ‘writer’ to ‘water’ two operations are needed (one replacement and one deletion) and hence, the edit distance will be 2.
Hamming Distance: It is the number of character positions in which the characters of two given strings are different.
Simon White’s Algorithm: Simon White, in one of his articles has demonstrated an algorithm to compute the similarity of strings and rank results based on it. The algorithm fares better than algorithms mentioned above in individuality. To quote from the article,
The algorithm was driven by the following requirements:
a). A true reflection of lexical similarity – strings with small differences should be recognized as being similar. In particular, a significant substring overlap should point to a high level of similarity between the strings.
b). A robustness to changes of word order – two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar.
c). Language Independence – the algorithm should work not only in English, but in many different languages.
About Simon White’s Algorithm – A Java Implementation: 2-shinling? W-shingling
http://www.devarticles.com/c/a/Development-Cycles/How-to-Strike-a-Match/ Note that whenever a match is found, that character pair is removed from the second array list to prevent us from matching against the same character pair multiple times. (Otherwise, ‘GGGGG’ would score a perfect match against ‘GG’.)
As detailed in the project, SimMetrics, there are way too many algorithms to determine distance metrics between two strings. SimMetrics is an open-source implementation (in Java) of some of the most popular similarity or distance metrics algorithms. This area has so well been explored that the best algorithms have already been developed, and I embark on the journey in discovering them 😉
A typical approach is to employ a two-pass mechanism in which an equivalence method is used by the database as a first pass filter, and a ranked similarity method is applied to the filtered entries for the second pass. Ranked similarity methods tend to be algorithmically more complex than equivalence methods, so are usually implemented as custom code outside of the database.
Lastly, you might imagine that this area of computing has been so well explored that the best algorithms have already been found and are well-known. However, it is still a research area and I also wouldn’t be at all surprised if the teams at Google are working on novel approximate string-matching algorithms right now!