python nlp gensim recommendation-engine doc2vec

Gensim's Doc2Vec with documents in multiple languages

I'm building a content based recommender system using similarities on vector representations of documents. My documents are descriptions of books. Most of them are in English, but some of them are in other languages. I used gensim's Doc2Vec for building vector representations of the documents.

Based on my understanding of this model, documents in different languages should have very low similarity, because the words are not even overlapping. But this is not what's happening. Documents in different languages in my dataset can have a high similarity as 0.9.

Why is this happening?

import pandas as pd
from pathlib import Path
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from gensim.utils import simple_preprocess
import numpy as np
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
data_folder = Path.cwd().parent / 'data'
transformed_folder = data_folder / 'transformed'
BOOKS_PATH = Path(transformed_folder, "books_final.csv")
import seaborn as sns

book_df = pd.read_csv(BOOKS_PATH)
languages=book_df.language.unique()

training_corpus = np.empty(len(book_df), dtype=object)
for i, (isbn, desc) in tqdm(enumerate(zip(book_df['isbn'], book_df['description']))):
    training_corpus[i] = TaggedDocument(simple_preprocess(desc), [str(i)])
model=Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(training_corpus)
#train the model
model.train(training_corpus, total_examples=model.corpus_count, epochs=model.epochs)
#get the keys of the trained model
model_keys=list(model.dv.key_to_index.keys())
matrix = model.dv.vectors
similarity_matrix = cosine_similarity(matrix)

indices_for_language = {}
for language, group_df in book_df.groupby('language'):
    indices_for_language[language] = group_df.index.to_list()

sim_japanese_italian = similarity_matrix[indices_for_language['Japanese']][:, indices_for_language['Italian']]
#make a heatmap
sns.heatmap(sim_japanese_italian)

Solution

You corpus is a bit small compared to published results for the Doc2Vec ("Paragraph Vectors") algorithm, which is usually demonstrated on set of tens-of-thousands to millions of documents.

In particular, if you only have (4742-4655=) 93 non-English description texts, and those are split over 16 non-English languages, it seems unlikely to me that any of the words from those other languages will have enough in-context usage examples to either (1) survive the min_count cutoff (which generally shouldn't be as low as 2!); or (2) obtain meaningful/generalizable implications in the model.

That is: all those very-rare words will have fairly idiosyncratic/arbitrary meanings, and description texts filled with all rare/weakly-understood words will tend to have weak, almost random doc-vectors. So, spurious similarities aren't that much of a surprise.

If those "non-English" description texts do have a few English-looking or omnilingual tokens in them – stray words or false cognates or punctuation or glitches – those might also dominate the description-to-description comparison.

You should check, on your suspiciously-too-similar pairs, how many of the words in each document are actually in the trained model – keeping in mind words not in the trained model are essentially ignored in training or post-rtraining inference. (A document of 100 words, in a language where there are only a few other 100-word documents, might in fact be dominated by single-appearance words, that don't even meet the min_count=2, and so when slimmed to the surviving words be short docs filled with weakly-understood words.)

If your similarity scores between English documents are generally sensible, then the model is doing what it can, where it has enough data. You really want this family of algorithms – word2vec, Doc2Vec, FastText, etc – to have many, varied, representative examples of any word's usage in sensible contrasting contexts for it to be able to do any meaningful later modeling of those words (and the documents that contain them).

So: if you can't manage to use a min_count of at least the usual default (5), or ideally even higher, you likely don't have sufficient data for these algorithms to show their strengths. Extending your corpus – even with other texts not of direct interest, but of sufficient variety to better represent the languages used in your texts of interest – might help. (That is: even if your universe-of-recommendations is <5000 descriptions, mixing another 200,000 similar short blurbs from other sources, so that words and langguages of interest have enough coverage, could be worthwhile.)