Search code examples
c#nlpsimilaritylemmatization

Analyze text (lemmatization, edit distance)


I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?

What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?

And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)

How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?


Solution

  • there're two possible solutions as far as I know algorithms.

    You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):

    http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

    Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).

    http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

    So now you could compare benefits of these two and yours suggestion.