Search code examples
openai-apilangchainchatgpt-apipinecone

OpenAI API: Will the data I send with API requests remain private?


I have created a Q&A bot using the OpenAI Embeddings API endpoint, Pinecone as a vector database, and OpenAI as an LLM. I am using Langchain and the gpt-3.5-turbo model. I am using my own dataset (PDF) files against which the question will be answered.

The solution is working properly. As of now, I have added test PDF files, but I want to use my private PDF files. Does my data remain private in this architecture?

Does OpenAI index my data in public space, or will it remain private to me?


Solution

  • The answer is not simple.

    As of today, OpenAI doesn't train models on inputs and outputs through API, as stated in the official OpenAI documentation:

    Screenshot

    But, technically speaking, once you make a request to the OpenAI API, you send data to the outside world. This is a big concern for many companies or even individuals. OpenAI tries hard to minimize these concerns with commitments, as stated on the official OpenAI website:

    Ownership: You own and control your data

    • We do not train on your data from ChatGPT Enterprise or our API Platform
    • You own your inputs and outputs (where allowed by law)
    • You control how long your data is retained (ChatGPT Enterprise)

    Control: You decide who has access

    • Enterprise-level authentication through SAML SSO
    • Fine-grained control over access and available features
    • Custom models are yours alone to use, they are not shared with anyone else

    Security: Comprehensive compliance

    • We’ve been audited for SOC 2 compliance
    • Data encryption at rest (AES-256) and in transit (TLS 1.2+)
    • Visit our Trust Portal to understand more about our security measures

    It's up to you to decide whether these commitments are enough for you to be comfortable making requests with (possibly) sensitive data to the OpenAI API. If yes, use the OpenAI API. Otherwise, run your local LLM.