Search code examples
stringmachine-learningnlpnltkstemming

How to remove unnecessary words from string for better search


I have different strings for searching the related data but due to unnecessary words, retrieved results are not good. For example, "Working of genetic algorithm", so the words "working of" are not important in here. I can remove "of" by considering it as a stop word. But how about "working"? I can do stemming but it will just remove "ing", which doesn't solve the problem. Similarly another string "Determination of.....", I consider that other words in the string are important and "Determination of" are not important, so I want to remove them before proceeding further. Any ideas or hints how I can remove these words, since there are a lot of these types of words and I cannot hardcode them.


Solution

  • Well, instead of removing such terms, I would suggest to focus on ngrams. Using the ngrams you can make different combination of search strings, and it could help you find the related information efficiently. Now it depends upon you to what number of combinations you want to make i.e. bigrams or trigrams. To do this, you can use python nltk library.