Search code examples
pythonpandasopenai-apitqdmpinecone

Inserting data as vectors from SQL Database to Pinecone


I have a profiles table in SQL with around 50 columns, and only 244 rows. I have created a view with only 2 columns, ID and content and in content I concatenated all data from other columns in a format like this: FirstName: John. LastName: Smith. Age: 70, Likes: Gardening, Painting. Dislikes: Soccer.

Then I created the following code to index all contents from the view into pinecone, and it works so far. However I noticed something strange.

  1. There are over 2000 vectors and still not finished, the first iterations were really fast, but now each iteration is taking over 18 seconds to finish and it says it will take over 40 minutes to finish upserting. (but for 244 rows only?)

What am I doing wrong? or is it normal?

 pinecone.init(
        api_key=PINECONE_API_KEY,  # find at app.pinecone.io
        environment=PINECONE_ENV  # next to api key in console
    )

    import streamlit as st
    st.title('Work in progress')
    embed = OpenAIEmbeddings(deployment=OPENAI_EMBEDDING_DEPLOYMENT_NAME, model=OPENAI_EMBEDDING_MODEL_NAME, chunk_size=1)
   
    cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+DATABASE_SERVER+'.database.windows.net;DATABASE='+DATABASE_DB+';UID='+DATABASE_USERNAME+';PWD='+ DATABASE_PASSWORD)
    query = "SELECT * from views.vwprofiles2;"
    df = pd.read_sql(query, cnxn)
    index = pinecone.Index("default")
   
    batch_limit = 100

    texts = []
    metadatas = []

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=20,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )
    

    for _, record in stqdm(df.iterrows(), total=len(df)):
        # First get metadata fields for this record
        metadata = {
            'IdentityId': str(record['IdentityId'])
        }
        # Now we create chunks from the record text
        record_texts = text_splitter.split_text(record['content'])
        # Create individual metadata dicts for each chunk
        record_metadatas = [{
            "chunk": j, "text": text, **metadata
        } for j, text in enumerate(record_texts)]
        # Append these to the current batches
        texts.extend(record_texts)
        metadatas.extend(record_metadatas)
        # If we have reached the batch_limit, we can add texts
        if len(texts) >= batch_limit:
            ids = [str(uuid4()) for _ in range(len(texts))]
            embeds = embed.embed_documents(texts)
            index.upsert(vectors=zip(ids, embeds, metadatas))    
            texts = []
            metadatas = []

        if len(texts) > 0:
            ids = [str(uuid4()) for _ in range(len(texts))]
            embeds = embed.embed_documents(texts)
            index.upsert(vectors=zip(ids, embeds, metadatas))

Solution

  • I have done some good research on the topic and have some recommendations

    Consider the following when optimizing code:

    • The specific hardware and software environment in which the code will be run.
    • The specific tasks that the code will be used for.
    • The level of performance that is required.

    With these factors in mind, it is possible to make significant improvements to the time and complexity of code.

    also:

    • Use a variety of data structures and algorithms to find the best fit for your task.
    • Optimize your code for the specific hardware and software environment in which it will be run.
    • Use a profiler to identify and fix performance bottlenecks.
    • Test your code thoroughly to ensure that it is correct and efficient.

    do this and you should be able to improve the time and complexity of your code.

    Example:

    from faker import Faker
    import pandas as pd
    import time
    
    # Initialize Faker for random data generation
    fake = Faker()
    
    # Create a DataFrame with 244 rows of random data
    data = {
        'IdentityId': [fake.uuid4() for _ in range(244)],
        'content': [fake.text(max_nb_chars=1000) for _ in range(244)]
    }
    df = pd.DataFrame(data)
    
    # Initialize lists for texts and metadata
    texts = []
    metadatas = []
    
    # Set the batch limit
    batch_limit = 500
    
    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=20,
        length_function=tiktoken_len,
        separators=["\n\n", "\n", " ", ""]
    )
    
    # Iterate through DataFrame rows
    # Time Complexity: O(n), where n is the number of rows in the DataFrame
    for _, record in df.iterrows():
        start_time = time.time()
        
        # Get metadata for this record
        # Time Complexity: O(1)
        metadata = {
            'IdentityId': str(record['IdentityId'])
        }
        print(f'Time taken for metadata extraction: {time.time() - start_time} seconds')
    
        start_time = time.time()
        # Split record text into chunks
        # Time Complexity: O(m), where m is the size of the text
        record_texts = text_splitter.split_text(record['content'])
        print(f'Time taken for text splitting: {time.time() - start_time} seconds')
    
        start_time = time.time()
        # Create metadata for each chunk
        # Time Complexity: O(k), where k is the number of chunks in the text
        record_metadatas = [{
            "chunk": j, "text": text, **metadata
        } for j, text in enumerate(record_texts)]
        print(f'Time taken for metadata dictionary creation: {time.time() - start_time} seconds')
    
        start_time = time.time()
        # Append chunks and metadata to current batches
        # Time Complexity: O(1)
        texts.extend(record_texts)
        metadatas.extend(record_metadatas)
        print(f'Time taken for data appending: {time.time() - start_time} seconds')
    
        # If batch_limit is reached, upsert vectors
        # Time Complexity: Depends on the upsert implementation
        if len(texts) >= batch_limit:
            start_time = time.time()
            ids = [str(uuid4()) for _ in range(len(texts))]
            # Simulating embedding and upserting here
            texts = []
            metadatas = []
            print(f'Time taken for vector upsertion (simulated): {time.time() - start_time} seconds')
    
    # Upsert any remaining vectors after the loop
    # Time Complexity: Depends on the upsert implementation
    if len(texts) > 0:
        start_time = time.time()
        ids = [str(uuid4()) for _ in range(len(texts))]
        # Simulating embedding and upserting here
        print(f'Time taken for remaining vector upsertion (simulated): {time.time() - start_time} seconds')