Currently I have managed to make a web interface to chat with a single PDF document using langchain as a framework, OpenAI as an LLM and Pinecone as a vector store. However, when I wanted to introduce new documents (5 new documents) PDF to the vecotres store, I realized that the information is different from the first document.
I have thought about introducing the resulting embeddings of all the pdf documents to Pinecone. But I have a doubt about whether the information can be crossed when specific information is requested from only one PDF document.
So I'm thinking that another way could be to add some selectors in the same web interface so that the user can choose from the PDF they want to obtain answers from. and thus the information is directed to the specific PDF. But perhaps the user's interaction with the web interface would not be so automatic.
This is why I want to find a way to send all pdf documents to pinecone, and perhaps in the vector store itself add an index for each document or add more collections. I appreciate if anyone has worked on something similar and can give me advice to continue with my task.
If your goal is to ensure that when you query for information related to a specific PDF document (e.g., "D", as you mentioned on your comment), the response should only include information from that particular document without interference from the content of other documents (A, B, C, E), you should store and query the embeddings for each document separately.
Combining all documents into one loses the granularity of individual documents. This means you won't be able to retrieve or analyze information at the document level. Any query or analysis will be based on the entire combined content. That means the dimensionality of the combined vector can be quite high. High-dimensional data can be challenging to work with in terms of computational resources and storage.
It is also difficult to update the combined documents. For example, if document A gets updated, then you need to combine all the documents again and then restore them. This will be a very expensive operation.
The benefit of combining all the documents is you can run a complex query.