I am trying to implement word2vec on a problem. I will briefly explain my problem statement:
I am dealing with clinical data. I want to predict the top N diseases given a set of symptoms.
Patient1: ['fever', 'loss of appetite', 'cold', '#flu#']
Patient2: ['hair loss', 'blood pressure', '#thyroid']
Patient3: ['hair loss', 'blood pressure', '#flu]
..
..
Patient30000: ['vomiting', 'nausea', '#diarrohea']
Note: 1.words with #prefix are diagnosis and the rest are symptoms
Applying word2vec on this corpus, I am able to generate the top 10 diagnosis given a set of input symptoms. Now, I want to understand how that output is generated. I know it's cosine similarity by adding the input vectors but I am unable to validate this output. Or understand how to improve this. Really want to understand what exactly is going on in the background which leads to these output.
Can anyone help me answer these questions or highlight what are the drawbacks/advantages of this approach
Word2vec is going to give you n-dimensional vectors that represent each of the diseases based on their co-occurrence. This means that you are representing each of the symptoms as a vector.
One row -
X = ['fever', 'loss of appetite']
X_onehot= [[0,0,0,1,0,0,0,0,0,0,0],
[0,0,0,0,0,0,0,0,1,0,0]]
X_word2vec= [[0.002,0.25,-0.1,0.335,0.7264],
[0.746,0.6463,0.0032,0.6301,0.223]]
Y = #flu
Now, you can represent each row in the data by taking the average of the word2vec such as -
X_avg = [[0.374 ,0.44815, -0.0484, 0.48255, 0.4747]]
Now you have a 5 length feature vector and a class for each row in your dataset. Next, you can treat it like any other machine learning problem.
If you want to predict the disease then just use a classification model after train-test split. That way you can validate the data.
Using cosine similarity to the word2vec vectors only yields similar symptoms. It will not let you build a disease recommendation model, because then you will be recommending a symptom based on other similar symptoms.