Search code examples
language-agnosticindexingfilteringstop-wordsnlp

"Stop words" list for English?


I'm generating some statistics for some English-language text and I would like to skip uninteresting words such as "a" and "the".

  • Where can I find some lists of these uninteresting words?
  • Is a list of these words the same as a list of the most frequently used words in English?

update: these are apparently called "stop words" and not "skip words".


Solution

  • The magic word to put into Google is "stop words". This turns up a reasonable-looking list.

    MySQL also has a built-in list of stop words, but this is far too comprehensive to my tastes. For example, at our university library we had problems because "third" in "third world" was considered a stop word.