I am dealing with patient data. I want to predict the top N diseases given a set of symptoms.
This is a sample of my dataset: In total I have around 1200 unique Symptoms and around 200 unique Diagnosis
ID Symptom combination Diagnosis
Patient1: fever, loss of appetite, cold Flu
Patient2: hair loss, blood pressure Thyroid
Patient3: hair loss, blood pressure Flu
Patient4: throat pain, joint pain Viral Fever
..
..
Patient30000: vomiting, nausea Diarrohea
What I am planning to do with this dataset is to use the Symptoms column to generate word vectors using Word2vec for each row of patient data. After generating the vectors I want to build a classifier, with the vectors in each row being my independent variable and the Diagnosis being the target categorical variable.
Shall I take the average of the vectors to generate feature vectors generated from word2vec? If so, any clarifications on the same?
You can average a bunch of word-vectors for symptoms together to get a single feature-vector of the same dimensionality. (If your word-vectors are 100d each, averaging them together gets a single 100d summary vector.)
But such averaging is fairly crude, and has some risk of diluting the information of each symptom in the averaging.
(As a simplified, stylized example, imagine a nurse took a patients' temperature at 9pm, and found it to be 102.6°F. Then again, at 7am, and found it to be 94.6°F. A doctor asks, "how's our patient's temperature?", and the nurse says the average, "98.6°F". "Wow," says the doctor, "it's rare for someone to be so on-the-dot for the normal healthy temperature. Next patient!" Averaging hid the important information: that the patient had both a fever and dangerous hypothermia.)
It sounds like you have a controlled-vocabulary of symptoms, with just some known, capped, and not-very-large number of symptom tokens: about 1200.
In such a case, turning those into a categorical vector for the presence/absence of each symptom may work far better than word2vec-based approaches. Maybe you have 100 different symptoms or 10,000 different symptoms. Either way, you can turn them into a large vector of 1s and 0s representing each possible symptom in order, and lots of classifiers will do pretty well with that input.
If treating the list-of-symptoms like a text-of-words, a simple "bag of words" representation of the text will essentially be this categorical representation: a 1200-dimensional 'one-hot' vector.
And unless this is some academic exercise where you've been required to use word2vec, it's not a good place to start, and may not be a part of the best solution. To train good word-vectors, you need more data than you have. (To re-use word-vectors from elsewhere, they should be well-matched to your domain.)
Word-vectors are most likely to help if you've gots tens-of-thousands to hundreds-of-thousands of terms, and many contextual examples of each of their uses, to plot their subtle variations-of-meaning in a dense shared space. Only 30,000 'texts', of ~3-5 tokens each, and only ~1200 unique tokens, is fairly small for word2vec.
(I made similar points in my comments on one of your earlier questions.)
Once you've turned each row into a feature vector – whether it's by averaging symptom word-vectors, or probably better creating a bag-of-words representation – you can and should try many different classifiers to see which works best.
Many are drop-in replacements for each other, and with the size of your data, testing many against each other in a loop may take less than an hour or few.
If at a total loss where to start, anything listed in the 'classifiers' upper-left area of this scikit-learn
graphical guide is worth trying:
If you want to consider an even wider range of possibilities, and get a vaguely-intuitive idea of which ones can best discover certain kinds of "shapes" in the underlying high-dimensional data, you can look at all those demonstrated in this scikit-learn
"classifier comparison" page, with these graphical representations of how well they handle a noisy 2d classification challenge (instead of your 1200d challenge):