python machine-learning scikit-learn svm naivebayes

What representation of chat text data should I use for user classification?

I'm trying to train a classifier to classify text from a chat between 2 users so later on I can predict who of the two users is more likely to say X sentence/word. To get there I mined the text from the chat log and ended up with two arrays of words, UserA_words and UserB_words.

In which format do I have to transform this arrays to pass it to a classifier like naiveBayes or SVM? How do I pass e.g. a bag of words representation to a classifier?

Solution

You're asking what ML representation you should use for user-classification of chat text.

bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user. Here are some:

character length, word length, sentence length of each comment
typing speed (esp. if you have timestamps in seconds)
ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
ratio of capitalization
ratio of numerals
ratio of whitespace
character n-grams (and notice these can pick up e.g. l0ser, f##k, :-) )
use of Unicode (emojis, symbols e.g. stars)
ratio of specific punctuation (e.g. how many '.', '!', '?', '*', '#' )
word-counts, esp. anything statistically anomalous
anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)