search nlp gensim cosine-similarity doc2vec

How to do language representation on huge documents of 3000-4000 word for query-based retrieval?

I am trying to implement a semantic search to retrieve similar documents from a dataset of unstructured French documents.

These documents are not categorized and are templates with 300 - 3000 words per document.
I am using doc2vec using gensim to find the paragraph embeddings with 300 dimensions and a window of 5 of the dataset.
I am then converting the search query which is a maximum of 5 words to the vector with 300 dimensions and comparing the cosine distance to find the document close to the search queries.

I am not getting good results. Please suggest some strategies to do the semantic search. I was trying to reduce the number of words in my dataset by doing rake keyword extraction.

Solution

The reason for your poor result if the queries are just too short to be embedded by doc2vec. If you only care about performance, I would recommend using some off-the-shelf information retrieval tools like Lucene.

If you want to play with neural nets and embeddings, you can do the following:

Just use word embedding, e.g., from FastText. Remove stop words both in the query and the documents and represent them with the average word embedding and do the comparison by cosine distance.
If you don't care about efficiency a lot, you can also try multilingual BERT (available in the Transformers library) or brand new French model called CamemBERT. In this case, you would just take the [cls] vectors and do the cosine distance on them.