Search code examples
stop-wordstopic-modelingmallet

Mallet - Topic Modeling - Stopwords Error


Although i add extra stopwords list and default stopwords list when i use MALLET for topic modeling, some stop words appear in topic models. For example "ın", "ıf", "ıt". How do i ensure that this stopwords don't appear in topic models? Topic models is below.

0 5 ı ıt time room door house people eyes thing night woman day make girl face mother voice car home

1 5 ıt ın fact sense point experience order form human action common general religious law part change number case evidence

2 5 time place work water long make cut ın square large top house side built machine building clay piece design

3 5 school people ın development national american members social program system economic group problems education class students work policy children

4 5 year york week home music american city house president day school club william show white ın days family night

5 5 ıt time fire feet river long road side miles game land run hit war gun big ball began arms

6 5 hands water white hand ın black food eyes face slowly sun cold ıt life red head hot long body

7 5 ın number system data surface temperature high low type volume information material pressure feed form small results shown method

8 5 world life church god war time great death book english ın century history england french west soviet love spirit

9 5 state year united government general business federal department court tax cost million company secretary act public ın service industry

Thanks for advice


Solution

  • Check the spelling of your stopwords. Mallet lowerceses your corpus by default, but it does not lowercase your stopwords!

    Also check the format of your stopword file: Mallet expects it to be one-word-per-line.

    And don't forget the option --stoplist-file yourstopwordfile.txt to the command mallet import-dir.

    EDIT: Beware of OCR errors in your input file: I see that in the topics words like "ın" are spelled with a dotless i (as used in Turkish orthography), not with the usual dotted i. So either apply some OCR correction before topic modelling or make the misspelled ın's with dotless i additional stopwords.

    EDIT2: There is another possible source for the dotless-i "ın", "ıf", "ıt": Mallet lowercases all words in the corpus. When your locale is set to Turkish, Java lowercases a capital I to a dotless i. Check your JAVA language settings and create the topic modell again from scratch.