Generating and using ChromaDB ids

I'm wondering how people deal with the ids in Chroma DB. I plan to store code-snippets (let's say single functions or classes) in the collection and need a unique id for each. These documents are going to be generated so the first problem is: how do I go about randomly generating an appropriate id.

I suppose it's possible that I may want to update a document at some point, so I'd need the id handy. This feels like a chicken and egg problem.. Am I supposed to store the ids in another db like postgres? And then how would I even know which id relates to which snippet? Query ChromaDB to first find the id of the most related document?

Solution

If you are going to be referencing the vector DB again by ID to find a specific entry that tells me that you have the entry IDs stored somewhere else. That being the case, I'd recommend using a combination of a regular database table and the vector db table. You could then use the auto-generated ID from the table to reference the vector ID.

You could do something like the following:

Regular Database Table (Table A):
- Insert your code snippets or documents into a regular database table. Let's call this table "CodeSnippets."
- This table should have an auto-generated primary key (e.g., an incrementing integer or a UUID) to ensure each document has a unique ID.
Retrieve Document IDs:
- After inserting a document into "CodeSnippets," retrieve the newly generated unique ID for that document.
Chroma DB Table (Table B):
- Simultaneously, add your document embeddings and associate them with the document's ID from step 2 to a Chroma DB table. Let's call this table "Embeddings."
- In "Embeddings," you can have two columns: one for the document ID (from Table A) and another for the document embeddings.
Done!
- You now have a system where you can easily reference your documents by their unique IDs, both in your regular database and Chroma DB.

Here's a simplified example using Python and a hypothetical database library (e.g., SQLAlchemy for SQL databases):

# Step 1: Insert data into the regular database (Table A)
# Assuming you have a SQLAlchemy model called CodeSnippet
from chromadb.utils import embedding_functions
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
import chromadb

Base = declarative_base()

class CodeSnippet(Base):
    __tablename__ = 'CodeSnippets'

    id = Column(Integer, primary_key=True, autoincrement=True)
    code = Column(String)
    # Add other metadata columns as needed

engine = create_engine('sqlite:///database.db')
Base.metadata.create_all(engine)

# Create a session
Session = sessionmaker(bind=engine)
session = Session()

# Insert a code snippet into Table A
new_snippet = CodeSnippet(code='print("Hello World")')
session.add(new_snippet)
session.commit()

# Step 2: Retrieve the newly generated document ID
document_id = str(new_snippet.id)

# Step 3: Add embeddings to Chroma DB (Table B)
client = chromadb.PersistentClient("./data")
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
collection = client.get_or_create_collection("code_snippets", embedding_function=sentence_transformer_ef)
collection.add([document_id], documents=[new_snippet.code])

# Step 4: You can now easily reference the document by its ID in both databases
# For example, you can retrieve the code snippet from Table A by its ID
result = session.query(CodeSnippet).filter(CodeSnippet.id == document_id).first()
print(result.code)

# Or you can retrieve the code snippet from Table B by its ID
result = collection.get(document_id)
print(result)

This approach ensures you have a clear mapping between your document data and embeddings, avoiding the chicken-and-egg problem you mentioned.