map_reduce not working as expected using langchain

I am trying to extract information about a csv using langchain and chatgpt.

If I just take a few lines of code and use the 'stuff' method it works perfectly. But when I use the whole csv with the map_reduce it fails in most of questions.

My current code is the following:

queries = ["Tell me the name of every driver who is German","how many german drivers are?",  "which driver uses the number 14?", "which driver has the oldest birthdate?"]

import os

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # read local .env file

from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma

files = ['drivers.csv','drivers_full.csv']

for file in files:
    print("=====================================")
    print(file)
    print("=====================================")
    with get_openai_callback() as cb:

        loader = CSVLoader(file_path=file,encoding='utf-8')
        docs = loader.load()

        from langchain.embeddings.openai import OpenAIEmbeddings

        embeddings = OpenAIEmbeddings()

        # create the vectorestore to use as the index
        db = Chroma.from_documents(docs, embeddings)
        # expose this index in a retriever interface
        retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":1000, "score_threshold":"0.2"})

        for query in queries:
            qa_stuff = RetrievalQA.from_chain_type(
                llm=OpenAI(temperature=0,batch_size=20), 
                chain_type="map_reduce", 
                retriever=retriever,
                verbose=True
            )

            print(query)
            result = qa_stuff.run(query)

            print(result)
            
        print(cb)

If fails in answering how many german drivers are, driver with number 14, oldest birthdate. Also the cost is huge (8$!!!!)

You have the code here: https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb

Solution

The way how "map_reduce" works, is that it first calls llm function on each Document (the "map" part), and then collect the answers of each call to produce a final answer (the "reduce" part). see LangChain Map Reduce type

LangChain's CSVLoader splits the CSV data source in such a way that each row becomes a separate document. This means if your CSV has 10000 rows, then it will call OpenAI API 10001 times (10000 for map, and 1 for reduce). And also, not all questions can be answered in the map-reduce way such as "How many", "What is the largest" etc. which requires data aggregation.

I think you have to use the "stuff" chain type. "gpt-3.5-turbo-16k" is good to go, which supports 16K context window and also much cheaper than OpenAI you choose.

Note gpt-3.5-turbo-16k is a chat model so you have to use ChatOpenAI instead of OpenAI.