I'm working with langchain and ChromaDb using python.
Now, I know how to use document loaders. For instance, the below loads a bunch of documents into ChromaDb:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()
But what if I wanted to add a single document at a time? More specifically, I want to check if a document exists before I add it. This is so I don't keep adding duplicates.
If a document does not exist, only then do I want to get embeddings and add it.
How do I do this using langchain? I think I mostly understand langchain but have no idea how to do seemingly basic tasks like this.
Here is an alternative filtering mechanism that uses a nice list comprehension trick that exploits the truthy evaluation associated with the or
operator in Python:
# Create a list of unique ids for each document based on the content
ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in docs]
unique_ids = list(set(ids))
# Ensure that only docs that correspond to unique ids are kept and that only one of the duplicate ids is kept
seen_ids = set()
unique_docs = [doc for doc, id in zip(docs, ids) if id not in seen_ids and (seen_ids.add(id) or True)]
# Add the unique documents to your database
db = Chroma.from_documents(unique_docs, embeddings, ids=unique_ids, persist_directory='db')
In the first line, a unique UUID is generated for each document by using the uuid.uuid5()
function, which creates a UUID using the SHA-1 hash of a namespace identifier and a name string (in this case, the content of the document).
The if
condition in the list comprehension checks whether the ID of the current document exists in the seen_ids
set:
seen_ids
using seen_ids.add(id)
, and the document gets included in unique_docs
.The or True
at the end is necessary to always return a truthy value to the if
condition, because seen_ids.add(id)
returns None
(which is falsy) even when an element is successfully added.
This approach is more practical than generating IDs using URLs or other document metadata, as it directly prevents the addition of duplicate documents based on content rather than relying on metadata or manual checks.