Search code examples
pythonmachine-learningscikit-learnsvmnaivebayes

What representation of chat text data should I use for user classification?


I'm trying to train a classifier to classify text from a chat between 2 users so later on I can predict who of the two users is more likely to say X sentence/word. To get there I mined the text from the chat log and ended up with two arrays of words, UserA_words and UserB_words.

In which format do I have to transform this arrays to pass it to a classifier like naiveBayes or SVM? How do I pass e.g. a bag of words representation to a classifier?


Solution

  • You're asking what ML representation you should use for user-classification of chat text.

    bag-of-words and word-vector are the main representations generally used in text-processing. However user-classification of chat is not the usual text-processing task, we look for telltale features indicative of a specific user. Here are some:

    • character length, word length, sentence length of each comment
    • typing speed (esp. if you have timestamps in seconds)
    • ratio of punctuation (e.g. 17 punctuation symbols in 80 chars = 17/80)
    • ratio of capitalization
    • ratio of numerals
    • ratio of whitespace
    • character n-grams (and notice these can pick up e.g. l0ser, f##k, :-) )
    • use of Unicode (emojis, symbols e.g. stars)
    • ratio of specific punctuation (e.g. how many '.', '!', '?', '*', '#' )
    • word-counts, esp. anything statistically anomalous
    • anything else you can think of that seems predictive for these two users, e.g. number of misspelled words per sentence (may be actual typos, or come from predictive swiping on a cellphone)