Search code examples
text-classificationunsupervised-learning

Split text files into two groups - unsupervised learning


Imagine, you are a librarian and during time you have classified a bunch of text files (approx 100) with a general ambiguous keyword.

Every text file is actually a topic of keyword_meaning1 or a topic of keyword_meaning2.

Which unsupervised learning approach would you use, to split the text files into two groups?

What precision (in percentage) of correct classification can be achieved according to a number of text files?

Or can be somehow indicated in one group, that there is a need of a librarian to check certain files, because they may be classifed incorrectly?


Solution

  • The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.

    Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.