I'm wondering how people deal with the ids in Chroma DB. I plan to store code-snippets (let's say single functions or classes) in the collection and need a unique id for each. These documents are going to be generated so the first problem is: how do I go about randomly generating an appropriate id.
I suppose it's possible that I may want to update a document at some point, so I'd need the id handy. This feels like a chicken and egg problem.. Am I supposed to store the ids in another db like postgres? And then how would I even know which id relates to which snippet? Query ChromaDB to first find the id of the most related document?
If you are going to be referencing the vector DB again by ID to find a specific entry that tells me that you have the entry IDs stored somewhere else. That being the case, I'd recommend using a combination of a regular database table and the vector db table. You could then use the auto-generated ID from the table to reference the vector ID.
You could do something like the following:
Regular Database Table (Table A):
Retrieve Document IDs:
Chroma DB Table (Table B):
Done!
Here's a simplified example using Python and a hypothetical database library (e.g., SQLAlchemy for SQL databases):
# Step 1: Insert data into the regular database (Table A)
# Assuming you have a SQLAlchemy model called CodeSnippet
from chromadb.utils import embedding_functions
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
import chromadb
Base = declarative_base()
class CodeSnippet(Base):
__tablename__ = 'CodeSnippets'
id = Column(Integer, primary_key=True, autoincrement=True)
code = Column(String)
# Add other metadata columns as needed
engine = create_engine('sqlite:///database.db')
Base.metadata.create_all(engine)
# Create a session
Session = sessionmaker(bind=engine)
session = Session()
# Insert a code snippet into Table A
new_snippet = CodeSnippet(code='print("Hello World")')
session.add(new_snippet)
session.commit()
# Step 2: Retrieve the newly generated document ID
document_id = str(new_snippet.id)
# Step 3: Add embeddings to Chroma DB (Table B)
client = chromadb.PersistentClient("./data")
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection("code_snippets", embedding_function=sentence_transformer_ef)
collection.add([document_id], documents=[new_snippet.code])
# Step 4: You can now easily reference the document by its ID in both databases
# For example, you can retrieve the code snippet from Table A by its ID
result = session.query(CodeSnippet).filter(CodeSnippet.id == document_id).first()
print(result.code)
# Or you can retrieve the code snippet from Table B by its ID
result = collection.get(document_id)
print(result)
This approach ensures you have a clear mapping between your document data and embeddings, avoiding the chicken-and-egg problem you mentioned.