Search code examples
pythonchatgpt-apipy-langchain

LangChain python - ability to abstract chunk of confidential text before submitting to LLM


If there are confidential document on which organization like to leverage LLM (e.g. OpenAI CHATGPT4) but just as precaution if they would like to abstract confidential information automatically then is it possible using langchain API (without loosing much of context). e.g. if there is name of company then it will just replace with "Company A" I am looking for option which are available as generic method like embedding which understands semantic meaning of words.


Solution

  • Looks like you need a redaction function before sending the data to chatgpt. There are AWS and Azure APIs that do PII redaction.

    https://aws.amazon.com/blogs/machine-learning/detecting-and-redacting-pii-using-amazon-comprehend/

    https://learn.microsoft.com/en-us/azure/ai-services/language-service/personally-identifiable-information/how-to-call

    To redact info that isn't PII there is NER (Named Entity Recognition) services available.
    https://huggingface.co/dslim/bert-base-NER