nlp gensim word2vec word-embedding doc2vec

Gensim doc2vec's d2v.wv.most_similar() gives not relevant words with high similarity scores

I've got a dataset of job listings with about 150 000 records. I extracted skills from descriptions using NER using a dictionary of 30 000 skills. Every skill is represented as an unique identificator.

My data example:

          job_title    job_id                                         skills
1  business manager         4               12 13 873 4811 482 2384 48 293 48
2    java developer        55    48 2838 291 37 484 192 92 485 17 23 299 23...
3    data scientist        21    383 48 587 475 2394 5716 293 585 1923 494 3

Then, I train a doc2vec model using these data where job titles (their ids to be precise) are used as tags and skills vectors as word vectors.

def tagged_document(df):
    for index, row in df.iterrows():
        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [str(row['job_id'])])
        
        
data_for_training = list(tagged_document(data[['job_id', 'skills']]))

model_d2v = gensim.models.doc2vec.Doc2Vec(dm=0, dbow_words=1, vector_size=80, min_count=3, epochs=100, window=100000)

model_d2v.build_vocab(data_for_training)

model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)

It works mostly okay, but I have issues with some job titles. I tried to collect more data from them, but I still have an unpredictable behavior with them.

For example, I have a job title "Director Of Commercial Operations" which is represented as 41 data records having from 11 to 96 skills (mean 32). When I get most similar words for it (skills in my case) I get the following:

docvec = model_d2v.docvecs[id_]
model_d2v.wv.most_similar(positive=[docvec], topn=5)

capacity utilization 0.5729076266288757
process optimization 0.5405482649803162
goal setting 0.5288119316101074
aeration 0.5124399662017822
supplier relationship management 0.5117508172988892

These are top 5 skills and 3 of them look relevant. However the top one doesn't look too valid together with "aeration". The problem is that none of the job title records have these skills at all. It seems like a noise in the output, but why it gets one of the highest similarity scores (although generally not high)? Does it mean that the model can't outline very specific skills for this kind of job titles? Can the number of "noisy" skills be reduced? Sometimes I see much more relevant skills with lower similarity score, but it's often lower than 0.5.

One more example of correct behavior with similar amount of data: BI Analyst, 29 records, number of skills from 4 to 48 (mean 21). The top skills look alright.

business intelligence 0.6986587047576904
business intelligence development 0.6861011981964111
power bi 0.6589289903640747
tableau 0.6500121355056763
qlikview (data analytics software) 0.6307920217514038
business intelligence tools 0.6143202781677246
dimensional modeling 0.6032138466835022
exploratory data analysis 0.6005223989486694
marketing analytics 0.5737696886062622
data mining 0.5734485387802124
data quality 0.5729933977127075
data visualization 0.5691111087799072
microstrategy 0.5566076636314392
business analytics 0.5535123348236084
etl 0.5516749620437622
data modeling 0.5512707233428955
data profiling 0.5495884418487549

Solution

If the your gold standard of what the model should report is skills that appeared in the training data, are you sure you don't want a simple count-based solution? For example, just provide a ranked list of the skills that appear most often in Director Of Commercial Operations listings?

On the other hand, the essence of compressing N job titles, and 30,000 skills, into a smaller (in this case vector_size=80) coordinate-space model is to force some non-intuitive (but perhaps real) relationships to be reflected in the model.

Might there be some real pattern in the model – even if, perhaps, just some idiosyncracies in the appearance of less-common skills – that makes aeration necessarily slot near those other skills? (Maybe it's a rare skill whose few contextual appearances co-occur with other skills very much near 'capacity utilization' -meaning with the tiny amount of data available, & tiny amount of overall attention given to this skill, there's no better place for it.)

Taking note of whether your 'anomalies' are often in low-frequency skills, or lower-freqeuncy job-ids, might enable a closer look at the data causes, or some disclaimering/filtering of most_similar() results. (The most_similar() method can limit its returned rankings to the more frequent range of the known vocabulary, for cases when the long-tail or rare words are, in with their rougher vectors, intruding in higher-quality results from better-reqpresented words. See the restrict_vocab parameter.)

That said, tinkering with training parameters may result in rankings that better reflect your intent. A larger min_count might remove more tokens that, lacking sufficient varied examples, mostly just inject noise into the rest of training. A different vector_size, smaller or larger, might better capture the relationships you're looking for. A more-aggressive (smaller) sample could discard more high-frequency words that might be starving more-interesting less-frequent words of a chance to influence the model.

Note that with dbow_words=1 & a large window, and records with (perhaps?) dozens of skills each, the words are having a much-more neighborly effect on each other, in the model, than the tag<->word correlations. That might be good or bad.