Search code examples
pythonlangchainchromadb

ChromaDb add single document, only if it doesn't exist


I'm working with langchain and ChromaDb using python.

Now, I know how to use document loaders. For instance, the below loads a bunch of documents into ChromaDb:

from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()

But what if I wanted to add a single document at a time? More specifically, I want to check if a document exists before I add it. This is so I don't keep adding duplicates.

If a document does not exist, only then do I want to get embeddings and add it.

How do I do this using langchain? I think I mostly understand langchain but have no idea how to do seemingly basic tasks like this.


Solution

  • Filter based solely on the Document's Content

    Here is an alternative filtering mechanism that uses a nice list comprehension trick that exploits the truthy evaluation associated with the or operator in Python:

    # Create a list of unique ids for each document based on the content
    ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in docs]
    unique_ids = list(set(ids))
    
    # Ensure that only docs that correspond to unique ids are kept and that only one of the duplicate ids is kept
    seen_ids = set()
    unique_docs = [doc for doc, id in zip(docs, ids) if id not in seen_ids and (seen_ids.add(id) or True)]
    
    # Add the unique documents to your database
    db = Chroma.from_documents(unique_docs, embeddings, ids=unique_ids, persist_directory='db')
    

    In the first line, a unique UUID is generated for each document by using the uuid.uuid5() function, which creates a UUID using the SHA-1 hash of a namespace identifier and a name string (in this case, the content of the document).

    The if condition in the list comprehension checks whether the ID of the current document exists in the seen_ids set:

    • If it doesn't exist, this implies the document is unique. It gets added to seen_ids using seen_ids.add(id), and the document gets included in unique_docs.
    • If it does exist, the document is a duplicate and gets ignored.

    The or True at the end is necessary to always return a truthy value to the if condition, because seen_ids.add(id) returns None (which is falsy) even when an element is successfully added.

    This approach is more practical than generating IDs using URLs or other document metadata, as it directly prevents the addition of duplicate documents based on content rather than relying on metadata or manual checks.