Search code examples
documentopenai-apilangchain

How does LangChain help to overcome the limited context size of ChatGPT?


It's not possible to pass long documents to ChatGPT directly due to its limited context size. So for example question answering or summarization of long documents is not possible at first sight. I've learned how ChatGPT can in principle "know" larger contexts -- basically by summarizing a sequence of previous contexts from the chat history -- but will this suffice to detect really long-range dependencies (bearing "meaning") inside really long texts?

LangChain seems to offer an solution, making use of OpenAI's API and vectorstores. I'm looking for a high-level description what's going on when LangChain makes accessible long documents or even corpora of long documents to ChatGPT and then makes use of ChatGPT's NLP abilities by clever automated prompting, e.g. question answering or summarization. Let's assume that the documents are already formatted as LangChain Document objects.


Solution

  • So you have you document(s) that contain a lot of text. LangChain can load these documents and get the text from these documents. As the text is too big to be used for the context it needs to be splitted into several chunks.

    This can for example be done with LangChain's RecursiveCharacterTextSplitter. You'll have to determine the size of the chunks, so it again won't be to large to be used for the context size. By setting a chunk overlap you can keep context between the chunks.

    After you have your desired chunks you'll need to create embeddings for each of these chunks, for example by using the OpenAIEmbeddings. These embeddings will be used to find to most relevant chunks of texts based to answer the query. This means that it will not use the entire content of the documents for the context, but only the most relevant parts of the document. These chunks and embeddings are stored into the vector database.

    When the relevant chunks are found these can be passed into a prompt template. For example a default prompt template in LangChain's RetrievalQA looks like this:

    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
    {{CONTEXT}}
    
    Question: {{QUESTION}}?
    Helpful Answer:
    

    This is then passed to the LLM to generate an answer.