Search code examples
chromadb

Generating and using ChromaDB ids


I'm wondering how people deal with the ids in Chroma DB. I plan to store code-snippets (let's say single functions or classes) in the collection and need a unique id for each. These documents are going to be generated so the first problem is: how do I go about randomly generating an appropriate id.

I suppose it's possible that I may want to update a document at some point, so I'd need the id handy. This feels like a chicken and egg problem.. Am I supposed to store the ids in another db like postgres? And then how would I even know which id relates to which snippet? Query ChromaDB to first find the id of the most related document?


Solution

  • If you are going to be referencing the vector DB again by ID to find a specific entry that tells me that you have the entry IDs stored somewhere else. That being the case, I'd recommend using a combination of a regular database table and the vector db table. You could then use the auto-generated ID from the table to reference the vector ID.

    You could do something like the following:

    1. Regular Database Table (Table A):

      • Insert your code snippets or documents into a regular database table. Let's call this table "CodeSnippets."
      • This table should have an auto-generated primary key (e.g., an incrementing integer or a UUID) to ensure each document has a unique ID.
    2. Retrieve Document IDs:

      • After inserting a document into "CodeSnippets," retrieve the newly generated unique ID for that document.
    3. Chroma DB Table (Table B):

      • Simultaneously, add your document embeddings and associate them with the document's ID from step 2 to a Chroma DB table. Let's call this table "Embeddings."
      • In "Embeddings," you can have two columns: one for the document ID (from Table A) and another for the document embeddings.
    4. Done!

      • You now have a system where you can easily reference your documents by their unique IDs, both in your regular database and Chroma DB.

    Here's a simplified example using Python and a hypothetical database library (e.g., SQLAlchemy for SQL databases):

    # Step 1: Insert data into the regular database (Table A)
    # Assuming you have a SQLAlchemy model called CodeSnippet
    from chromadb.utils import embedding_functions
    from sqlalchemy import create_engine, Column, Integer, String
    from sqlalchemy.orm import sessionmaker
    from sqlalchemy.ext.declarative import declarative_base
    import chromadb
    
    Base = declarative_base()
    
    class CodeSnippet(Base):
        __tablename__ = 'CodeSnippets'
    
        id = Column(Integer, primary_key=True, autoincrement=True)
        code = Column(String)
        # Add other metadata columns as needed
    
    engine = create_engine('sqlite:///database.db')
    Base.metadata.create_all(engine)
    
    # Create a session
    Session = sessionmaker(bind=engine)
    session = Session()
    
    # Insert a code snippet into Table A
    new_snippet = CodeSnippet(code='print("Hello World")')
    session.add(new_snippet)
    session.commit()
    
    # Step 2: Retrieve the newly generated document ID
    document_id = str(new_snippet.id)
    
    # Step 3: Add embeddings to Chroma DB (Table B)
    client = chromadb.PersistentClient("./data")
    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-MiniLM-L6-v2"
    )
    collection = client.get_or_create_collection("code_snippets", embedding_function=sentence_transformer_ef)
    collection.add([document_id], documents=[new_snippet.code])
    
    # Step 4: You can now easily reference the document by its ID in both databases
    # For example, you can retrieve the code snippet from Table A by its ID
    result = session.query(CodeSnippet).filter(CodeSnippet.id == document_id).first()
    print(result.code)
    
    # Or you can retrieve the code snippet from Table B by its ID
    result = collection.get(document_id)
    print(result)
    
    

    This approach ensures you have a clear mapping between your document data and embeddings, avoiding the chicken-and-egg problem you mentioned.