python machine-learning nlp word2vec text-classification

Build a multiclass text classifier which takes vectors generated from word2vec as independent variables to predict a class

I am dealing with patient data. I want to predict the top N diseases given a set of symptoms.

This is a sample of my dataset: In total I have around 1200 unique Symptoms and around 200 unique Diagnosis

     ID         Symptom combination                              Diagnosis
    Patient1: fever, loss of appetite, cold                        Flu
    Patient2: hair loss, blood pressure                           Thyroid
    Patient3: hair loss, blood pressure                            Flu
    Patient4: throat pain, joint pain                           Viral Fever

    ..
    ..
Patient30000: vomiting, nausea                                   Diarrohea

What I am planning to do with this dataset is to use the Symptoms column to generate word vectors using Word2vec for each row of patient data. After generating the vectors I want to build a classifier, with the vectors in each row being my independent variable and the Diagnosis being the target categorical variable.

Shall I take the average of the vectors to generate feature vectors generated from word2vec? If so, any clarifications on the same?

Solution

You can average a bunch of word-vectors for symptoms together to get a single feature-vector of the same dimensionality. (If your word-vectors are 100d each, averaging them together gets a single 100d summary vector.)

But such averaging is fairly crude, and has some risk of diluting the information of each symptom in the averaging.

(As a simplified, stylized example, imagine a nurse took a patients' temperature at 9pm, and found it to be 102.6°F. Then again, at 7am, and found it to be 94.6°F. A doctor asks, "how's our patient's temperature?", and the nurse says the average, "98.6°F". "Wow," says the doctor, "it's rare for someone to be so on-the-dot for the normal healthy temperature. Next patient!" Averaging hid the important information: that the patient had both a fever and dangerous hypothermia.)

It sounds like you have a controlled-vocabulary of symptoms, with just some known, capped, and not-very-large number of symptom tokens: about 1200.

In such a case, turning those into a categorical vector for the presence/absence of each symptom may work far better than word2vec-based approaches. Maybe you have 100 different symptoms or 10,000 different symptoms. Either way, you can turn them into a large vector of 1s and 0s representing each possible symptom in order, and lots of classifiers will do pretty well with that input.

If treating the list-of-symptoms like a text-of-words, a simple "bag of words" representation of the text will essentially be this categorical representation: a 1200-dimensional 'one-hot' vector.

And unless this is some academic exercise where you've been required to use word2vec, it's not a good place to start, and may not be a part of the best solution. To train good word-vectors, you need more data than you have. (To re-use word-vectors from elsewhere, they should be well-matched to your domain.)

Word-vectors are most likely to help if you've gots tens-of-thousands to hundreds-of-thousands of terms, and many contextual examples of each of their uses, to plot their subtle variations-of-meaning in a dense shared space. Only 30,000 'texts', of ~3-5 tokens each, and only ~1200 unique tokens, is fairly small for word2vec.

(I made similar points in my comments on one of your earlier questions.)

Once you've turned each row into a feature vector – whether it's by averaging symptom word-vectors, or probably better creating a bag-of-words representation – you can and should try many different classifiers to see which works best.

Many are drop-in replacements for each other, and with the size of your data, testing many against each other in a loop may take less than an hour or few.

If at a total loss where to start, anything listed in the 'classifiers' upper-left area of this scikit-learn graphical guide is worth trying:

If you want to consider an even wider range of possibilities, and get a vaguely-intuitive idea of which ones can best discover certain kinds of "shapes" in the underlying high-dimensional data, you can look at all those demonstrated in this scikit-learn "classifier comparison" page, with these graphical representations of how well they handle a noisy 2d classification challenge (instead of your 1200d challenge):