This is my code:
public void SearchWordSynonymsByLevenstein()
{
foreach (var eachWord in wordCounter)
{
foreach (var eachSecondWord in wordCounter)
{
if (eachWord.Key.Length > 3)
{
var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
if (score < 2)
{
if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
{
if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
{
wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
}
else
{
wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
}
}
}
}
}
}
}
My wordCounter
is Dictionary<string, int>
where key is my each word and value is count how many this word exists in documents. Something like Bag of word. I must search synonyms for eachWord
from other eachSecondWord
. This method cost too much time. Time increases exponentially. There is other way to reduce the time ?
First off I assume you don't want to associate a word with itself in the wordSynonymsByLevenstein
collection. Second you can skip ones that you know will not meet your < 2 score requirement by comparing the lengths of the words.
public void SearchWordSynonymsByLevenstein()
{
foreach (var eachWord in wordCounter)
{
foreach (var eachSecondWord in wordCounter)
{
if (eachWord.Key == eachSecondWord.Key
|| eachWord.Key.Length <= 3
|| Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2)
{
continue;
}
var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
if (score >= 2)
{
continue;
}
if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
{
if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
{
wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
}
else
{
wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
}
}
}
}
}
Your requirement that is expressed with the if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))
is not particularly obvious or straight forward, but if you don't want a word associated with more than one, then you could additionally add a HashSet<string>
and as you associate words add them to that HashSet
and check if the next word is in there before continuing, instead of iterating the nested dictionaries.
public void SearchWordSynonymsByLevenstein()
{
var used = new HashSet<string>();
foreach (var eachWord in wordCounter)
{
foreach (var eachSecondWord in wordCounter)
{
if (eachWord.Key == eachSecondWord.Key
|| eachWord.Key.Length <= 3
|| Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2)
{
continue;
}
var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key);
if (score >= 2)
{
continue;
}
if(used.Add(eachSecondWord.Key)))
{
if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key))
{
wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } });
}
else
{
wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value);
}
}
}
}
}
Here I used if(used.Add(eachSecondWord.Key)))
because Add
will return true
if the word was added and false
if it was already in the HashSet
.