Search code examples
apache-sparktf-idfnaivebayes

Term Frequency and IDF - Clarification


Based on the link , https://en.wikipedia.org/wiki/Tf%E2%80%93idf , IDF is used to negate the weightage of frequently used words in a document ( like "the" , "of" etc )

If I am applying stop words removal before extracting features , should IDF be applied ? I feel only Term Frequency would be sufficient since the repeated unimportant words are already filtered.

Please adivse


Solution

  • Even if you use stop word removal, IDF will still be useful in most cases.

    I personally try to avoid stop word removal: it is language-dependent, the content of the list is arbitrary and you may remove useful words. Stopword removal is like using IDF and saying: from this cutoff point, everything above is good, everything below is useless (no "in between" zone!), which, obviously, cannot reflect the real nature of language.

    But the best way to answer your question is to experiment with both approaches: if you use TF-IDF in the context of a text classification or information retrieval process, why not try test with and without IDF and see which one yields the best accuracy?