Search code examples
pythonpdfopenai-apichat-gpt-4

How can I process a pdf using OpenAI's APIs (GPTs)?


The web interface for ChatGPT has an easy pdf upload. Is there an API from openAI that can receive pdfs?

I know there are 3rd party libraries that can read pdf but given there are images and other important information in a pdf, it might be better if a model like GPT 4 Turbo was fed the actual pdf directly.

I'll state my use case to add more context. I intent to do RAG. In the code below I handle the PDF and a prompt. Normally I'd append the text at the end of the prompt. I could still do that with a pdf if I extract its contents manually.

The following code is taken from here https://platform.openai.com/docs/assistants/tools/code-interpreter. Is this how I'm supposed to do it?

# Upload a file with an "assistants" purpose
file = client.files.create(
  file=open("example.pdf", "rb"),
  purpose='assistants'
)

# Create an assistant using the file ID
assistant = client.beta.assistants.create(
  instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model="gpt-4-1106-preview",
  tools=[{"type": "code_interpreter"}],
  file_ids=[file.id]
)

There is an upload endpoint as well, but it seems the intent of those endpoints are for fine-tuning and assistants. I think the RAG use case is a normal one and not necessarily related to assistants.


Solution

  • As of today (openai.__version__==1.42.0) using OpenAI Assistants + GPT-4o allows to extract content of (or answer questions on) an input pdf file foobar.pdf stored locally, with a solution along the lines of

    from openai import OpenAI
    from openai.types.beta.threads.message_create_params import (
        Attachment,
        AttachmentToolFileSearch,
    )
    import os
    
    filename = "foobar.pdf"
    prompt = "Extract the content from the file provided without altering it. Just output its exact content and nothing else."
    
    client = OpenAI(api_key=os.environ.get("MY_OPENAI_KEY"))
    
    pdf_assistant = client.beta.assistants.create(
        model="gpt-4o",
        description="An assistant to extract the contents of PDF files.",
        tools=[{"type": "file_search"}],
        name="PDF assistant",
    )
    
    # Create thread
    thread = client.beta.threads.create()
    
    file = client.files.create(file=open(filename, "rb"), purpose="assistants")
    
    # Create assistant
    client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        attachments=[
            Attachment(
                file_id=file.id, tools=[AttachmentToolFileSearch(type="file_search")]
            )
        ],
        content=prompt,
    )
    
    # Run thread
    run = client.beta.threads.runs.create_and_poll(
        thread_id=thread.id, assistant_id=pdf_assistant.id, timeout=1000
    )
    
    if run.status != "completed":
        raise Exception("Run failed:", run.status)
    
    messages_cursor = client.beta.threads.messages.list(thread_id=thread.id)
    messages = [message for message in messages_cursor]
    
    # Output text
    res_txt = messages[0].content[0].text.value
    print(res_txt)
    

    The prompt can of course be replaced with the desired user request and I assume that the openai key is stored in a env var named MY_OPENAI_KEY.

    Limitations: