I'm using text classification to classify Arabic dialects, so far I have 4 dialects. However, now I want the classifier to detect the formal(standard or grammatical) language of those dialects which is called MSA(Modern Standard Arabic).
Should I use grammatical analysis? build a language model? or I do the same as I did with the dialects by collecting MSA tweets and then train them?
You can train a language model for each dialects of the language. Then, given a sentence find the (log) probability returned by each language model and assign it to the language model which returns the high score.
p* = argmax p_i p_i(sentence)
where p_i
is the language model of the dialects i.
Language model is a probability distribution over sequences of words. Given a sentence, say of length m
, it assigns a probability P(w1, ... ,wm)
to the whole sequence. So the sentence will belong to the dialect whose P_i(w)
is high, where P_i
is the language model of dialect i
.