python python-3.x openai-api langchain large-language-model

LangChain: Querying a document and getting structured output using Pydantic with ChatGPT not working well

I am trying to get a LangChain application to query a document that contains different types of information. To facilitate my application, I want to get a response in a specific format, so I am using Pydantic to structure the data as I need, but I am running into an issue.

Sometimes ChatGPT doesn't respect the format from my Pydantic structure, and so I get an exception raised and my program stops. Sure, I can handle the exception, but I much rather that ChatGPT respects the format, and I wonder if I am doing something wrong.

More specifically:

The date is not formatted in the right way from ChatGPT since it returns the date from the document as it found it, and not in a datetime.date format.
The Enum Field from Pydantic also doesn't work well, as sometimes the documents have Lastname, and not Surname, and ChatGPT formats it as Lastname and it doesn't transform it to Surname.

Lastly, I do not know if I am using the chains correctly because I keep getting confused with all the different examples in the LangChain documentation.

After loading all the necessary packages, this is the code I have:

FILE_PATH = 'foo.pdf'

class NameEnum(Enum):
    Name = 'Name'
    Surname = 'Surname'

class DocumentSchema(BaseModel):
    date: datetime.date = Field(..., description='The date of the doc')
    name: NameEnum = Field(..., description='Is it name or surname?')

def main():
    loader = PyPDFLoader(FILE_PATH)
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=10)
    all_splits = text_splitter.split_documents(data)
    vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
    llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
    question = """What is the date on the document?
        Is it about a name or surname?
    """

    doc_prompt = PromptTemplate(
        template="Content: {page_content}\nSource: {source}",
        input_variables=["page_content", "source"],
    )
    prompt_messages = [
        SystemMessage(
            content=(
                "You are a world class algorithm for extracting information in structured formats."
            )
        ),
        HumanMessage(content="Answer the questions using the following context"),
        HumanMessagePromptTemplate.from_template("{context}"),
        HumanMessagePromptTemplate.from_template("Question: {question}"),
        HumanMessage(
            content="Tips: Make sure to answer in the correct format"
        ),
    ]

    chain_prompt = ChatPromptTemplate(messages=prompt_messages)

    chain = create_structured_output_chain(output_schema=DocumentSchema, llm=llm, prompt=chain_prompt)
    final_qa_chain_pydantic = StuffDocumentsChain(
        llm_chain=chain,
        document_variable_name="context",
        document_prompt=doc_prompt,
    )
    retrieval_qa_pydantic = RetrievalQA(
        retriever=vectorstore.as_retriever(), combine_documents_chain=final_qa_chain_pydantic
    )
    data = retrieval_qa_pydantic.run(question)

Depending on the file that I am checking, executing the script will raise an error because of the formats from Pydantic not being respected by the return of ChatGPT.

What am I missing here?

Thank you!

Solution

I managed to solve my issue and here is what I did to solve them.

try/except block

First, I added a try/except block around the chain execution code to catch those naughty errors without stopping my execution.

Cleaning vectorstore

I also noticed that the vectorstore variable was not getting "cleaned" on each run I would do for different documents on the same execution so that I would have old data on new documents. I realized that I needed to clean the vectorstore on each run:

try:
    # Retrieve the data
    data = retrieval_qa_pydantic.run(question)
    # Delete the embeddings for the next run
    vectorstore.delete()
except error_wrappers.ValidationError as e:
    log.error(f'Error parsing file: {e}')
else:
    return data
return None

Tips about formatting

Then, I noticed I needed to be more explicit with the data formatting. I modified the instructions to fit my requirements with extra help like this:

HumanMessage(
        content="Tips: Make sure to answer in the correct format. Dates should be in the format YYYY-MM-DD."
    ),

The key was the Tips part of the message. From that moment on, I had no more formatting problems regarding the date.

Enum with `None`

To solve the issue of the Enum, I modified the class to account for a None value, meaning when the LLM cannot find the info I need, it sets the variable to None. This is how I fixed it:

class NameEnum(Enum):
    Name = 'Name'
    Surname = 'Surname'
    NON = None

Last but not least, I noticed that I was getting a lot of wrong information from my documents, so I had to tweak some extra things:

Bigger splits and `gpt-4`

I increased the splits to 500 instead of 200, and to improve the accuracy of my task, I used gpt-4 as a model and not gpt-3.5-turbo anymore. By changing the size of the chunks and using gpt-4, I removed any inconsistencies, and the data extraction works almost flawlessly.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)
all_splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
llm = ChatOpenAI(model_name='gpt-4', temperature=0)

I hope these tips help someone in the future.