I could successfully load and process my confluence data with scale like:
However when I tried to persist it in vectorDB with something like:
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=persist_directory
)
it ran over a couple of hours on my modest laptop eventually throwing an exception of APIConnectionError: Connection error.
Is it some kind of timeout? If so, how do I get around it?
Any ideas?
Stackoveflow does not allow me to share the complete call stack here since it is very large but I have posted it at: https://community.deeplearning.ai/t/apiconnectionerror-connection-error-exception-persisting-confluence-data-to-vectordb/544670 in case it helps to narrow down the issue.
You can find my complete code at: https://github.com/sameermahajan/GenAI/blob/main/LangChain/Chat-With-Your-Data/chat_azure.ipynb in which you need to replace the PDF loader part by https://github.com/sameermahajan/GenAI/blob/main/LangChain/Chat-With-Your-Data/confluence.py for confluence.
This was actually due to low chunk size. When I increased it to 1000 it works fine. I had overlooked the number copy pasting the code from a sample!