Search code examples
langchainchromadb

APIConnectionError: Connection error exception persisting confluence data to chromadb


I could successfully load and process my confluence data with scale like:

  • 868 documents
  • 1 million splits

However when I tried to persist it in vectorDB with something like:

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

it ran over a couple of hours on my modest laptop eventually throwing an exception of APIConnectionError: Connection error.

Is it some kind of timeout? If so, how do I get around it?

Any ideas?

Stackoveflow does not allow me to share the complete call stack here since it is very large but I have posted it at: https://community.deeplearning.ai/t/apiconnectionerror-connection-error-exception-persisting-confluence-data-to-vectordb/544670 in case it helps to narrow down the issue.

You can find my complete code at: https://github.com/sameermahajan/GenAI/blob/main/LangChain/Chat-With-Your-Data/chat_azure.ipynb in which you need to replace the PDF loader part by https://github.com/sameermahajan/GenAI/blob/main/LangChain/Chat-With-Your-Data/confluence.py for confluence.


Solution

  • This was actually due to low chunk size. When I increased it to 1000 it works fine. I had overlooked the number copy pasting the code from a sample!