I am trying to implement a RAG architecture in AWS with documents that are in Spanish.
My question is the following: does it matter if I generate the embeddings of the documents with a model trained in English or multilingual? Or do I have to generate the embeddings with a model trained specifically in Spanish?
I am currently using the GPT-J-6b model to generate the embeddings and the Falcon-40b model to generate the response (inference), but when doing the similarity search I do not get good results.
The other question I have is: is it good practice to use the same model both to generate the embeddings and to generate the inference?
GPT-J-6b is trained on The Pile, which is mainly English, except for the EuroParl part, which contains Spanish but probably not of the same domain as your text. This makes GPT-J-6b not very appropriate for generating embeddings for Spanish text.
You should use a model trained on Spanish data, either only Spanish or multilingual. Of course, the more different the training data domain and yours, the worse the matches you will get.
About using the same model both to generate the embeddings and to generate the inference, it should not be important. They are applied to different parts of the architecture.