I am out of ideas on how to complete this task. I am counting the frequency of a word, actually the base form of the word (e.g. running will be counted as run). I looked up on some implementations of Levenshtein distance (one implementation I run into is from dotnerperls).
I also tried the double Metaphone, but it isn't what I'm looking for.
So, please give me some ideas on how to tweak Levenshtein distance algorithm in classifying linguistically similar words since the algorithm is only for determining the number of edits needed not considering if they are linguistically similar or not
Example: 1. "running" will be counted as one occurrence of the word "run" 2. "word" will likewise be an occurrence of "word" 3. "fear" will NOT be counted as an occurrence of "gear"
Also, I am implementing it in C#.
Thanks in advance.
Edit: I edited it as Rene suggested. Another note: I am trying to consider to consider if a word is a substring of another word but that implementation will not be as much dynamic. Another idea I think is: "if adding -s or -ing to string1, string1 == string2, then string2 is an occurrence of string1." However, this is not the case as some words have irregular plurals.
The task you are trying to solve is called Stemming or Lemmatisation.
As you figured out already, Levenshtein-Distance is not the way to go here. Common stemming-algorithms for english include the Porter- and Snowball-Stemmer. If you google for that I'm sure you will find a C#-implementation of one of them.