Algorithm for indexing of words by their pronunciation
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages.[1] Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions.[2]
Soundex, which was developed to encode surnames for use in censuses. Soundex codes are four-character strings composed of a single letter followed by three numbers.
Daitch–Mokotoff Soundex, which is a refinement of Soundex designed to better match surnames of Slavic and Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six numeric digits.
Cologne phonetics: This is similar to Soundex, but more suitable for German words.
Metaphone and Double Metaphone which are suitable for use with most English words, not just names. Metaphone algorithms are the basis for many popular spell checkers.
Match Rating Approach developed by Western Airlines in 1977 - this algorithm has an encoding and range comparison technique.
Caverphone, created to assist in data matching between late 19th century and early 20th century electoral rolls, optimized for accents present in parts of New Zealand.
Spell checkers can often contain phonetic algorithms. The Metaphone algorithm, for example, can take an incorrectly spelled word and create a code. The code is then looked up in directory for words with the same or similar Metaphone. Words that have the same or similar Metaphone become possible alternative spellings.
Search functionality will often use phonetic algorithms to find results that don't match exactly the term(s) used in the search. Searching for names can be difficult as there are often multiple alternative spellings for names. An example is the name Claire. It has two alternatives, Clare/Clair, which are both pronounced the same. Searching for one spelling wouldn't show results for the two others. Using Soundex all three variations produce the same Soundex code, C460. By searching names based on the Soundex code all three variations will be returned.
Data deduplication efforts use phonetic algorithms to easily bucket records into groups of similar sounding names for further evaluation.
Speech to text modules use phonetic encoding to find the set of dictionary words that are pronounced similarly to the phonemes output by the processed audio signal.