openai-api langchain large-language-model py-langchain

How to use Langchain text splitter to reduce tokens in my text

I am using Langchain with OpenAI API for getting the summary of PDF Files. Some of my PDFs have many pages (more than the max token allowed in ChatGPT). Im trying two approaches to reduce the tokens so that I can input longer texts, but is still not working for a 300 inch- PDF.

Retrieval augmented generation: more specifically the text splitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 50)
    all_splits = text_splitter.split_documents(data)

Text summarisation: using stuff documents chain

 stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

I would like to understand what is the text splitter doing because is not helping me to input longer text in the prompt. How can do this?

Solution

Text splitter breaks down text on tokens and new lines, in chunks the size you specify by chunk_size. Try printing out your data before you split the documents and after so you can see how many documents were generated.

The purpose of using a splitter is to break document down into chunks so when you are doing retrieval you can get back the most relevant pieces of text, rather than just passing in one large blob of text to the model (which you would likely hit token limit).

It won't allow you to put MORE into the prompt, but it will allow you to fit the most relevant information within the token limit.

There really is no way to fit very long documents into GPT which is why people use text splitters, usually along with some sort of similarity search to retrieve the most relevant ones (although there are other models with higher token limits).