Search code examples
machine-learningnltksvmnaivebayesnltk-trainer

Word Classification using Machine Learning Algorithm


I am a newbie to machine learning. What I currently want is to classify whether some words comes under a category or not..

Let me be more specific, On inputting some words I need to check whether those words comes under a language known as "Malayalam".

Example: enthayi ninakk sugamanno?

These are some malayalam words which are expressed in english. On giving some input like this, it need to check the trained data and if any of the input words comes under the category 'Malayalam' then it needs to display that it's Malayalam.

What I've tried to do..

I tried to classify it with a NaiveBayesClassifier, but it always shows a positive response for all the input data.

train = [
('aliya','Malayalam')]
cl = NaiveBayesClassifier(train)
print cl.classify('enthayi ninakk sugamanno')

But the print statement gives an output 'Malayalam'


Solution

  • You need both positive and negative data to train a classifier. It wouldn't be hard to add a bunch of English text, or whatever the likely alternatives are in your domain. But you need to read up on how an nltk classifier actually works, or you'll only be able to handle words that you've seen in your training data: You need to select and extract "features" that the classifier will use to do its job.

    So (from the comments) you want to categorize individual words as being Malayalam or not. If your "features" are whole words, you are wasting your time with a classifier; just make a Python set() of Malayalam words, and check if your inputs are in it. To go the classifier route, you'll have to figure out what makes a word "look" Malayalam to you (endings? length? syllable structure?) and manually turn these properties into features so that the classifier can decide how important they are.

    A better approach for language detection is to use letter trigrams: Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. I had good results with "cosine similarity" as a measure of distance between the sample text and the reference data. In this question you'll see how to calculate cosine similarity, but for unigram counts; use trigrams for language identification.

    Two benefits of the trigram approach: You are not dependent on familiar words, or on coming up with clever features, and you can apply it to stretches of text longer than a single word (even after filtering out English), which will give you more reliable results. The nltk's langid corpus provides trigram counts for hundreds of common languages, but it's also easy enough to compile your own statistics. (See also nltk.util.trigrams().)