Search code examples
c#.netregexstringsoundex

Implement soundex in .Net for entire sentences


I have a regex/soundex type method:

public static string SoundEx(string word)
{
    if (word.All(char.IsDigit))
    {
        //sentenceParts = words;
        return word;
    }
    word = word.ToUpper();
    word = word[0] +
        Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(
            Regex.Replace(word.Substring(1), "[AEIOUYHW]", ""), "[BFPV]+", "1"), "[CGJKQSXZ]+", "2"), "[DT]+", "3"), "[L]+", "4"), "[MN]+", "5"), "[R]+", "6");

    return word;//word.PadRight(4, '0').Substring(0, 4);
}

This works fine on strings with one word but as soon as you dump a sentence in it can start acting funny.

"The big brown cat." and "The big brown dog."

Come up as a match. Now I understand it keeps the first char of the first word and then starts using the regex to match numbers to the vowels, etc. But how can I implement this on an entire sentence making it more accurate?


Solution

  • You have to soundex each word separately. That turns the sentence into a set of 4-byte codes instead of string of characters. You then compare the sets against each other.

    So your example becomes "T000 B200 B650 D200" v "T000 B200 B650 C300".

    I would recommend using the double-Metaphone algorithm instead of soundex as its much, much better, it also does not rely on the first letter remaining the same, which doesn't help match words like Fishing and Phishing.