I am trying to implement a semantic search to retrieve similar documents from a dataset of unstructured French documents.
I am not getting good results. Please suggest some strategies to do the semantic search. I was trying to reduce the number of words in my dataset by doing rake keyword extraction.
The reason for your poor result if the queries are just too short to be embedded by doc2vec. If you only care about performance, I would recommend using some off-the-shelf information retrieval tools like Lucene.
If you want to play with neural nets and embeddings, you can do the following:
Just use word embedding, e.g., from FastText. Remove stop words both in the query and the documents and represent them with the average word embedding and do the comparison by cosine distance.
If you don't care about efficiency a lot, you can also try multilingual BERT (available in the Transformers library) or brand new French model called CamemBERT. In this case, you would just take the [cls]
vectors and do the cosine distance on them.