Search code examples
pythonmachine-learningnlptext-classification

How to train a classifier to detect vernacular from grammatical language?


I'm using text classification to classify Arabic dialects, so far I have 4 dialects. However, now I want the classifier to detect the formal(standard or grammatical) language of those dialects which is called MSA(Modern Standard Arabic).

Should I use grammatical analysis? build a language model? or I do the same as I did with the dialects by collecting MSA tweets and then train them?


Solution

  • You can train a language model for each dialects of the language. Then, given a sentence find the (log) probability returned by each language model and assign it to the language model which returns the high score.

    p* = argmax p_i p_i(sentence)
    

    where p_i is the language model of the dialects i.

    Language model is a probability distribution over sequences of words. Given a sentence, say of length m, it assigns a probability P(w1, ... ,wm) to the whole sequence. So the sentence will belong to the dialect whose P_i(w) is high, where P_i is the language model of dialect i.