Search code examples
c#levenshtein-distancesynonym

Synonyms for words by Levenshtein Distance


This is my code:

public void SearchWordSynonymsByLevenstein()
{
    foreach (var eachWord in wordCounter)
    {
        foreach (var eachSecondWord in wordCounter)
        {
            if (eachWord.Key.Length > 3)
            {
                var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
                if (score < 2)
                {
                    if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
                    {
                        if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
                        {
                            wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
                        }
                        else
                        {
                            wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
                        }
                    }
                }
            }
        }
    }
}

My wordCounter is Dictionary<string, int> where key is my each word and value is count how many this word exists in documents. Something like Bag of word. I must search synonyms for eachWord from other eachSecondWord. This method cost too much time. Time increases exponentially. There is other way to reduce the time ?


Solution

  • First off I assume you don't want to associate a word with itself in the wordSynonymsByLevenstein collection. Second you can skip ones that you know will not meet your < 2 score requirement by comparing the lengths of the words.

    public void SearchWordSynonymsByLevenstein()
    {
        foreach (var eachWord in wordCounter)
        {
            foreach (var eachSecondWord in wordCounter)
            {
                if (eachWord.Key == eachSecondWord.Key 
                    || eachWord.Key.Length <= 3 
                    || Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2)
                {
                    continue;
                }
                var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
                if (score >= 2)
                {
                    continue;
                }
    
                if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
                {
                    if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
                    {
                        wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
                    }
                    else
                    {
                        wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
                    }
                }
    
            }
        }
    }
    

    Your requirement that is expressed with the if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key))) is not particularly obvious or straight forward, but if you don't want a word associated with more than one, then you could additionally add a HashSet<string> and as you associate words add them to that HashSet and check if the next word is in there before continuing, instead of iterating the nested dictionaries.

    public void SearchWordSynonymsByLevenstein()
    {
        var used = new HashSet<string>();
        foreach (var eachWord in wordCounter)
        {
            foreach (var eachSecondWord in wordCounter)
            {
                if (eachWord.Key == eachSecondWord.Key 
                    || eachWord.Key.Length <= 3 
                    || Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2)
                {
                    continue;
                }
                var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
                if (score >= 2)
                {
                    continue;
                }
    
                if(used.Add(eachSecondWord.Key)))
                {
                    if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
                    {
                        wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
                    }
                    else
                    {
                        wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
                    }
                }
    
            }
        }
    }
    

    Here I used if(used.Add(eachSecondWord.Key))) because Add will return true if the word was added and false if it was already in the HashSet.