I am trying to get a LangChain application to query a document that contains different types of information. To facilitate my application, I want to get a response in a specific format, so I am using Pydantic to structure the data as I need, but I am running into an issue.
Sometimes ChatGPT doesn't respect the format from my Pydantic structure, and so I get an exception raised and my program stops. Sure, I can handle the exception, but I much rather that ChatGPT respects the format, and I wonder if I am doing something wrong.
More specifically:
Lastly, I do not know if I am using the chains correctly because I keep getting confused with all the different examples in the LangChain documentation.
After loading all the necessary packages, this is the code I have:
FILE_PATH = 'foo.pdf'
class NameEnum(Enum):
Name = 'Name'
Surname = 'Surname'
class DocumentSchema(BaseModel):
date: datetime.date = Field(..., description='The date of the doc')
name: NameEnum = Field(..., description='Is it name or surname?')
def main():
loader = PyPDFLoader(FILE_PATH)
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=10)
all_splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
question = """What is the date on the document?
Is it about a name or surname?
"""
doc_prompt = PromptTemplate(
template="Content: {page_content}\nSource: {source}",
input_variables=["page_content", "source"],
)
prompt_messages = [
SystemMessage(
content=(
"You are a world class algorithm for extracting information in structured formats."
)
),
HumanMessage(content="Answer the questions using the following context"),
HumanMessagePromptTemplate.from_template("{context}"),
HumanMessagePromptTemplate.from_template("Question: {question}"),
HumanMessage(
content="Tips: Make sure to answer in the correct format"
),
]
chain_prompt = ChatPromptTemplate(messages=prompt_messages)
chain = create_structured_output_chain(output_schema=DocumentSchema, llm=llm, prompt=chain_prompt)
final_qa_chain_pydantic = StuffDocumentsChain(
llm_chain=chain,
document_variable_name="context",
document_prompt=doc_prompt,
)
retrieval_qa_pydantic = RetrievalQA(
retriever=vectorstore.as_retriever(), combine_documents_chain=final_qa_chain_pydantic
)
data = retrieval_qa_pydantic.run(question)
Depending on the file that I am checking, executing the script will raise an error because of the formats from Pydantic not being respected by the return of ChatGPT.
What am I missing here?
Thank you!
I managed to solve my issue and here is what I did to solve them.
First, I added a try/except block around the chain execution code to catch those naughty errors without stopping my execution.
I also noticed that the vectorstore
variable was not getting "cleaned" on each run I would do for different documents on the same execution so that I would have old data on new documents. I realized that I needed to clean the vectorstore on each run:
try:
# Retrieve the data
data = retrieval_qa_pydantic.run(question)
# Delete the embeddings for the next run
vectorstore.delete()
except error_wrappers.ValidationError as e:
log.error(f'Error parsing file: {e}')
else:
return data
return None
Then, I noticed I needed to be more explicit with the data formatting. I modified the instructions to fit my requirements with extra help like this:
HumanMessage(
content="Tips: Make sure to answer in the correct format. Dates should be in the format YYYY-MM-DD."
),
The key was the Tips
part of the message. From that moment on, I had no more formatting problems regarding the date.
None
To solve the issue of the Enum, I modified the class to account for a None
value, meaning when the LLM cannot find the info I need, it sets the variable to None
. This is how I fixed it:
class NameEnum(Enum):
Name = 'Name'
Surname = 'Surname'
NON = None
Last but not least, I noticed that I was getting a lot of wrong information from my documents, so I had to tweak some extra things:
gpt-4
I increased the splits to 500 instead of 200, and to improve the accuracy of my task, I used gpt-4
as a model and not gpt-3.5-turbo
anymore. By changing the size of the chunks and using gpt-4
, I removed any inconsistencies, and the data extraction works almost flawlessly.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)
all_splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
llm = ChatOpenAI(model_name='gpt-4', temperature=0)
I hope these tips help someone in the future.