I am trying to extract information about a csv using langchain and chatgpt.
If I just take a few lines of code and use the 'stuff' method it works perfectly. But when I use the whole csv with the map_reduce it fails in most of questions.
My current code is the following:
queries = ["Tell me the name of every driver who is German","how many german drivers are?", "which driver uses the number 14?", "which driver has the oldest birthdate?"]
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv()) # read local .env file
from langchain.document_loaders import CSVLoader
from langchain.callbacks import get_openai_callback
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
files = ['drivers.csv','drivers_full.csv']
for file in files:
print("=====================================")
print(file)
print("=====================================")
with get_openai_callback() as cb:
loader = CSVLoader(file_path=file,encoding='utf-8')
docs = loader.load()
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# create the vectorestore to use as the index
db = Chroma.from_documents(docs, embeddings)
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":1000, "score_threshold":"0.2"})
for query in queries:
qa_stuff = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0,batch_size=20),
chain_type="map_reduce",
retriever=retriever,
verbose=True
)
print(query)
result = qa_stuff.run(query)
print(result)
print(cb)
If fails in answering how many german drivers are, driver with number 14, oldest birthdate. Also the cost is huge (8$!!!!)
You have the code here: https://github.com/pablocastilla/langchain-embeddings/blob/main/langchain-embedding-full.ipynb
The way how "map_reduce" works, is that it first calls llm function on each Document (the "map" part), and then collect the answers of each call to produce a final answer (the "reduce" part). see LangChain Map Reduce type
LangChain's CSVLoader splits the CSV data source in such a way that each row becomes a separate document. This means if your CSV has 10000 rows, then it will call OpenAI API 10001 times (10000 for map, and 1 for reduce). And also, not all questions can be answered in the map-reduce way such as "How many", "What is the largest" etc. which requires data aggregation.
I think you have to use the "stuff" chain type. "gpt-3.5-turbo-16k" is good to go, which supports 16K context window and also much cheaper than OpenAI you choose.
Note gpt-3.5-turbo-16k is a chat model so you have to use ChatOpenAI instead of OpenAI.