Although i add extra stopwords list and default stopwords list when i use MALLET for topic modeling, some stop words appear in topic models. For example "ın", "ıf", "ıt". How do i ensure that this stopwords don't appear in topic models? Topic models is below.
0 5 ı ıt time room door house people eyes thing night woman day make girl face mother voice car home
1 5 ıt ın fact sense point experience order form human action common general religious law part change number case evidence
2 5 time place work water long make cut ın square large top house side built machine building clay piece design
3 5 school people ın development national american members social program system economic group problems education class students work policy children
4 5 year york week home music american city house president day school club william show white ın days family night
5 5 ıt time fire feet river long road side miles game land run hit war gun big ball began arms
6 5 hands water white hand ın black food eyes face slowly sun cold ıt life red head hot long body
7 5 ın number system data surface temperature high low type volume information material pressure feed form small results shown method
8 5 world life church god war time great death book english ın century history england french west soviet love spirit
9 5 state year united government general business federal department court tax cost million company secretary act public ın service industry
Thanks for advice
Check the spelling of your stopwords. Mallet lowerceses your corpus by default, but it does not lowercase your stopwords!
Also check the format of your stopword file: Mallet expects it to be one-word-per-line.
And don't forget the option --stoplist-file yourstopwordfile.txt
to the command mallet import-dir
.
EDIT: Beware of OCR errors in your input file: I see that in the topics words like "ın" are spelled with a dotless i (as used in Turkish orthography), not with the usual dotted i. So either apply some OCR correction before topic modelling or make the misspelled ın's with dotless i additional stopwords.
EDIT2: There is another possible source for the dotless-i "ın", "ıf", "ıt": Mallet lowercases all words in the corpus. When your locale is set to Turkish, Java lowercases a capital I to a dotless i. Check your JAVA language settings and create the topic modell again from scratch.