Search code examples
information-retrievallemur

Indexing collections with stopword removal in Galago


I successfully indexed a collection using Galago. I didn't found any parameter for removing stopwords for indexing. Does galago remove stopwords automatically? If no, how can I pass the stopwords list to Galago and how I can tell Galago to remove stopwords?


Solution

  • Galago, as a research search engine, tries not to make assumptions that can't be taken back: by default, indexes are built for stemmed and unstemmed terms.

    During index time, no stopwords are removed, putting the burden on you at query-time, but which allows for changing or tuning stopword lists on a training set.

    If you want stopword removal, it needs to be a query-time step. If you think about it, this is what any modern search engine wants unless cramped for disk space: the query "to be or not to be" is unanswerable without stopwords or more sophisticated techniques, but it is better to write some code that will remove stopwords unless it empties the query than to remove them unconditionally.

    Galago provides access to the "inquery" stopword list through the WordLists class.