python langchain embedding large-language-model llama-index

Integrating llama index vectorstoreindex with Langchain agents for RAG Applications

I have been reading the documentation all day and can't seem to wrap my head around how I can create a VectorStoreIndex with llama_index and use the created embeddings as supplemental information for a RAG application/chatbot that can communicate with a user. I want to use llama_index because they have some cool ways to perform more advanced retrieval techniques like sentence window retrieval and auto-merging retrieval (to be fair I have not investigated if Langchain also supports these types of vector retrieval methods). I want to use LangChain because of its functionality for developing more complex prompt templates (similarly I have not really investigated if llama_index supports this).

My goal is to ultimately evaluate how these different retrieval methods perform within the context of the application/chatbot. I know how to evaluate them with a separate evaluation questions file, but I would like to do things like compare the speed and humanness of responses, token usage, etc.

The code for a minimal reproducible example would be as follows

1) LangChain ChatBot initiation 
   from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
    from langchain.memory import ChatMessageHistory
    
    
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """You are the world's greatest... \
                Use this document base to help you provide the best support possible to everyone you engage with. 
                """,
            ),
            MessagesPlaceholder(variable_name="messages"),
        ]
    )
    
    chat = ChatOpenAI(model=llm_model, temperature=0.7)
    
    
    
    chain = prompt | chat
    
    
    chat_history = ChatMessageHistory()
    
    while True:
        user_input = input("You: ")
        chat_history.add_user_message(user_input)
        
        response = chain.invoke({"messages": chat_history.messages})
        
        if user_input.lower() == 'exit':
            break
        
        print("AI:", response)
        chat_history.add_ai_message(response)

Llama index sentence window retrieval

from llama_index.core.node_parser import SentenceWindowNodeParser
        from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor
        from llama_index.core.postprocessor import LLMRerank
    
    class SentenceWindowUtils:
        def __init__(self, documents, llm, embed_model, sentence_window_size):
            self.documents = documents
            self.llm = llm
            self.embed_model = embed_model
            self.sentence_window_size = sentence_window_size
            # self.save_dir = save_dir
    
            self.node_parser = SentenceWindowNodeParser.from_defaults(
                window_size=self.sentence_window_size,
                window_metadata_key="window",
                original_text_metadata_key="original_text",
            )
    
            self.sentence_context = ServiceContext.from_defaults(
                llm=self.llm,
                embed_model=self.embed_model,
                node_parser=self.node_parser,
            )
    
        def build_sentence_window_index(self, save_dir):
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
                sentence_index = VectorStoreIndex.from_documents(
                    self.documents, service_context=self.sentence_context
                )
                sentence_index.storage_context.persist(persist_dir=save_dir)
            else:
                sentence_index = load_index_from_storage(
                    StorageContext.from_defaults(persist_dir=save_dir),
                    service_context=self.sentence_context,
                )
    
            return sentence_index
    
        def get_sentence_window_query_engine(self, sentence_index, similarity_top_k=6, rerank_top_n=3):
            postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
            rerank = LLMRerank(top_n=rerank_top_n, service_context=self.sentence_context)
    
            sentence_window_engine = sentence_index.as_query_engine(
                similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
            )
    
            return sentence_window_engine
    
    
        sentence_window = SentenceWindowUtils(documents=documents, llm = llm, embed_model=embed_model, sentence_window_size=1)
        sentence_window_1 = sentence_window.build_sentence_window_index(save_dir='./indexes/sentence_window_index_1')
        sentence_window_engine_1 = sentence_window.get_sentence_window_query_engine(sentence_window_1)

Both blocks of code independently will run. But the goal is that when a query is performed that warrants a retrieval to the existing document base, I can use the sentence_window_engine that was built. I suppose I could retrieve relevant information based on the query and then pass that information into a subsequent prompt for the chatbot, but I would like to try and avoid including the document data in a prompt.

Any suggestions?

Solution

I never found an exact way to retrieve the information via llama_index like I had hoped but I basically found a workaround by doing what I initially wanted to avoid by querying my document base and adding that as context information to my chatbot as such

#### Conversation Prompt Chain #####
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are the world's greatest...
            You have access to an extensive document base of information.
            Relevant Information to the user query is provided below. Use the information at your own discretion if it improves the quality of the response.
            A summary of the previous conversation is also provided to contextualize you on previous conversation.

            <<Relevant Information>>
            {relevant_information}


            << Previous Conversation Summary>>
            {previous_conversation}


            << Current Prompt >>
            {user_input}
            """,
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

chat = ChatOpenAI(model=llm_model, temperature=0.0)



chain = prompt | chat


### Application Start ###


while True:
    # Some code....
    if route['destination'] == "data querying":
                formatted_response = query_and_format_sql(username, password, host, port, mydatabase, query_prompt, model = 'gpt-4', client_name = client_name, user_input=user_input)
                print(formatted_response)
                chat_history.add_ai_message(AIMessage(f'The previous query triggered a SQL agent response that was {formatted_response}'))
        else:
            # Search Document Base
            RAG_Context = sentence_window_engine_1.query(user_input)
    
            # Inject the retrieved information into the chatbot's context
            context_with_relevant_info = {
                "user_input": user_input,
                "messages": chat_history.messages,
                "previous_conversation": memory.load_memory_variables({}),
                "relevant_information": RAG_Context # ==> Inject relevant information from llama_index here
            }
            
            response = chain.invoke(context_with_relevant_info)

I haven't ran into a token issue yet but I can imagine if my application grows and scales it may run into problem trying to inject relevant information, the message history, and the prompt. I limit my memory with a ConversationBufferMemoryHistory and that seems to work ok for now.