Search code examples
pythonlangchainpalm-apihnswlib

How to load an existing vector db into Langchain?


I have the following code which loads my pdf file generates embeddings and stores them in a vector db. I can then use it to preform searches on it.

The issue is that every time i run it the embeddings are regrated and stored in the db along with the ones already created.

Im trying to figurer out How to load an existing vector db into Langchain. rather then recreating them every time the app runs.

enter image description here

load it

def load_embeddings(store, file):
    # delete the dir
    # shutil.rmtree(store)  # I have to delete it or it just loads double data

    loader = PyPDFLoader(file)
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        is_separator_regex=False,
    )
    pages = loader.load_and_split(text_splitter)

    return DocArrayHnswSearch.from_documents(
        pages, GooglePalmEmbeddings(), work_dir=store + "/", n_dim=768
    )

use it

db = load_embeddings("linda_store", "linda.pdf")
embeddings = GooglePalmEmbeddings()

query = "Have I worked with Oauth?"
embedding_vector = embeddings.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)

for i in range(len(docs)):
    print(i, docs[i])

issue

This works fine but if I run it again it just loads the file again into the vector db. I want it to just use the db after I have created it and not create it again.

I cant seem to find a method for loading it I tried

db = DocArrayHnswSearch.load("hnswlib_store/", embeddings)

But thats a no go.


Solution

  • Your load_embeddings function is recreating the database every time you call it. Here's why:

    1. You're loading from PyPDFLoader every time

    ...
    # We don't need this when loading from store
    loader = PyPDFLoader(file) 
    ...
    

    2. from_documents(documents, embedding, **kwargs)

    ...
    # We don't need to pass pages when loading from store
    return DocArrayHnswSearch.from_documents(
        pages, GooglePalmEmbeddings(), work_dir=store + "/", n_dim=768
    )
    ...
    

    Instead, you can try this:

    def query_vector_store(query):
        embeddings = OpenAIEmbeddings(openai_api_key=open_ai_key)
        vector_store = DocArrayHnswSearch.from_params(embeddings, "store/", 1536)
        
        embedding_vector = embeddings.embed_query(query)
        
        return vector_store.similarity_search_by_vector(embedding_vector)
    

    I am using OpenAIEmbeddings() here but the same code should apply to GooglePalmEmbeddings() just make sure you update the value of the dimension.

    1. DocArrayHnswSearch.from_params

    We're using DocArrayHnswSearch.from_params instead to load embeddings from the store (see here). This method does not expect the documents.

    2. We're using our vector_store to perform similarity search

    As you can see from the query_vector_store(query: str) function above, we're not re-loading the documents from the PDF loader every time. Instead, we're just passing in our embeddings, work directory, and dimensions.

    3. Usage

    You can use the method as such: query_vector_store('YOUR_QUERY').

    Based on your for loop here:

    for i in range(len(docs)):
        print(i, docs[i])
    

    You'll see the documents sorted by most similar.

    I hope this helps!