openai-api langchain chatgpt-api pinecone

OpenAI API: Will the data I send with API requests remain private?

I have created a Q&A bot using the OpenAI Embeddings API endpoint, Pinecone as a vector database, and OpenAI as an LLM. I am using Langchain and the gpt-3.5-turbo model. I am using my own dataset (PDF) files against which the question will be answered.

The solution is working properly. As of now, I have added test PDF files, but I want to use my private PDF files. Does my data remain private in this architecture?

Does OpenAI index my data in public space, or will it remain private to me?

Solution

The answer is not simple.

As of today, OpenAI doesn't train models on inputs and outputs through API, as stated in the official OpenAI documentation:

But, technically speaking, once you make a request to the OpenAI API, you send data to the outside world. This is a big concern for many companies or even individuals. OpenAI tries hard to minimize these concerns with commitments, as stated on the official OpenAI website:

Ownership: You own and control your data

We do not train on your data from ChatGPT Enterprise or our API Platform

You own your inputs and outputs (where allowed by law)

You control how long your data is retained (ChatGPT Enterprise)

Control: You decide who has access

Enterprise-level authentication through SAML SSO

Fine-grained control over access and available features

Custom models are yours alone to use, they are not shared with anyone else

Security: Comprehensive compliance

We’ve been audited for SOC 2 compliance

Data encryption at rest (AES-256) and in transit (TLS 1.2+)

Visit our Trust Portal to understand more about our security measures

It's up to you to decide whether these commitments are enough for you to be comfortable making requests with (possibly) sensitive data to the OpenAI API. If yes, use the OpenAI API. Otherwise, run your local LLM.