In today's world, data privacy is very important, especially when working with large language models (LLMs) and sensitive information. Companies and individuals often need to use private data, such as personal identifiable information (PII), into their LLM applications. However, the risk of data leaks and privacy breaches is a constant threat, making it necessary to implement data protection measures.
In this blog post, we'll explore a solution for protecting private data when building question-answering systems using LLMs. We'll dive into the concept of data anonymization, which involves replacing sensitive information with synthetic data or placeholders before feeding it to the LLM. By using LangChain and the Presidio library from Microsoft, we can create a secure and customizable anonymization pipeline for our LLM applications.
What is PII and Why Protect It?
Personally identifiable information (PII) refers to any data that can be used to identify, contact, or locate an individual. PII can be categorized into two types:
Linked Information: This includes direct identifiers such as email addresses, phone numbers, Social Security numbers, and passport numbers.
Linkable Information: This refers to indirect information that can be combined with other data to identify an individual. For example, an age, occupation, and location combination might uniquely identify someone.
It's crucial to protect PII from potential data breaches or unauthorized access, as the consequences can be severe, including identity theft, financial fraud, and legal implications.
Protecting Data from What?
While using external APIs like OpenAI's or Anthropic, our data may be at risk of being leaked or stored for a certain period (e.g., 30 days). Even if we host our own LLM instance, there is still a risk of data leaks or model's memorization of sensitive information during training.
To avoid these risks, we have two main options:
Host our own LLM: This allows us to keep the data on-premises, but it can be costly, and the available models may not match the performance of GPT-4o or other state-of-the-art LLMs.
Anonymize data before feeding it to the LLM: By replacing sensitive information with placeholders or synthetic data, we can protect our private data while still usingof external LLMs or APIs.
In this blog post, we'll focus on the second option: data anonymization using LangChain and Presidio.
Introducing Presidio
Presidio is an open-source library from Microsoft that provides robust and customizable tools for anonymizing text data. It consists of two main components:
Analyzer: This component recognizes and identifies PII entities in the text using built-in patterns, regular expressions, and named entity recognition models.
Anonymizer: This component replaces the identified PII entities with placeholders, markers, or synthetic data.
Presidio offers high customizability, allowing us to add custom recognizers and operators to handle specific data formats or requirements.
Anonymization with LangChain and Presidio
Let's dive into the code and explore how we can integrate Presidio with LangChain to create a secure question-answering system with data anonymization.
1. Initialize the Anonymizer
First, we need to initialize the PresidioReversibleAnonymizer
from LangChain, which provides the ability to restore the original data after anonymization.
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
import re
anonymizer = PresidioReversibleAnonymizer(
add_default_faker_operators=False,
)
2. Anonymize the Data
Now, we can anonymize the text by replacing the identified PII entities with placeholders or markers:
from langchain_core.documents import Document
def print_colored_pii(string):
colored_string = re.sub(
r"(<[^>]*>)", lambda m: "\033[31m" + m.group(1) + "\033[0m", string
)
print(colored_string)
document_content = """Date: October 19, 2021
Witness: John Doe
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.
Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532.
What's more, I had my polish identity card there, with the number ABC123456.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.
In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.
My representative there is Victoria Cherry (her business phone: 987-654-3210).
Thank you for your assistance,
John Doe"""
documents = [Document(page_content=document_content)]
anonymized_text = anonymizer.anonymize(document_content)
print_colored_pii(anonymized_text)
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.
Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>.
What's more, I had my polish identity card there, with the number ABC123456.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at <DATE_TIME_2>.
In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.
Please consider this information to be highly confidential
This will output the anonymized text with markers like <PERSON>
, <DATE_TIME>
, <CREDIT_CARD>
, etc., replacing the sensitive information.
from pprint import pprint
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021', '<DATE_TIME_2>': '9:30 AM'},
'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
'<EMAIL_ADDRESS_2>': 'support@bankname.com'},
'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
'LOCATION': {'<LOCATION>': 'Kilmarnock'},
'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
'UK_NHS': {'<UK_NHS>': '987-654-3210'},
'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
'US_SSN': {'<US_SSN>': '602-76-4532'}}
In general, the anonymizer works pretty well, but I can observe two things to improve here:
Datetime redundancy - we have two different entities recognized as
DATE_TIME
, but they contain different type of information. The first one is a date (October 19, 2021), the second one is a time (9:30 AM). We can improve this by adding a new recognizer to the anonymizer, which will treat time separately from the date.Polish ID - polish ID has unique pattern, which is not by default part of anonymizer recognizers. The value ABC123456 is not anonymized.
The solution is simple: we need to add a new recognizers to the anonymizer.
3. Add Custom Recognizers
Since Presidio may not recognize certain PII entities by default, we can add custom recognizers to handle specific data formats. For example, let's create recognizers for Polish ID numbers and time formats:
from presidio_analyzer import Pattern, PatternRecognizer
polish_id_pattern = Pattern(
name="polish_id_pattern",
regex="[A-Z]{3}\d{6}",
score=1,
)
time_pattern = Pattern(
name="time_pattern",
regex="(1[0-2]|0?[1-9]):[0-5][0-9] (AM|PM)",
score=1,
)
polish_id_recognizer = PatternRecognizer(
supported_entity="POLISH_ID", patterns=[polish_id_pattern]
)
time_recognizer = PatternRecognizer(supported_entity="TIME", patterns=[time_pattern])
anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)
# Note that our anonymization instance remembers previously detected and anonymized values,
# including those that were not detected correctly (e.g., "9:30 AM" taken as DATE_TIME). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: <DATE_TIME>
Witness: <PERSON>
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is <PERSON> and on <DATE_TIME>, my wallet was stolen in the vicinity of <LOCATION> during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number <CREDIT_CARD>, which is registered under my name and linked to my bank account, <IBAN_CODE>.
Additionally, the wallet had a driver's license - DL No: <US_DRIVER_LICENSE> issued to my name. It also houses my Social Security Number, <US_SSN>.
What's more, I had my polish identity card there, with the number <POLISH_ID>.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at <TIME>.
In case any information arises regarding my wallet, please reach out to me on my phone number, <PHONE_NUMBER>, or through my personal email, <EMAIL_ADDRESS>.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, <EMAIL_ADDRESS_2>.
My representative there is <PERSON_2> (her business phone: <UK_NHS>).
Thank you for your assistance,
<PERSON>
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},
'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',
'<EMAIL_ADDRESS_2>': 'support@bankname.com'},
'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
'LOCATION': {'<LOCATION>': 'Kilmarnock'},
'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},
'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
'POLISH_ID': {'<POLISH_ID>': 'ABC123456'},
'TIME': {'<TIME>': '9:30 AM'},
'UK_NHS': {'<UK_NHS>': '987-654-3210'},
'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
'US_SSN': {'<US_SSN>': '602-76-4532'}}
As you can see, our new recognizers work as expected. The anonymizer has replaced the time and Polish ID entities with the <TIME>
and <POLISH_ID>
markers, and the deanonymizer mapping has been updated accordingly.
4. Synthesizing Original Values
Now, when all PII values are detected correctly, we can proceed to the next step, which is replacing the original values with synthetic ones. To do this, we need to set add_default_faker_operators=True
(or just remove this parameter, because it's set to True
by default):
anonymizer = PresidioReversibleAnonymizer(
add_default_faker_operators=True,
# Faker seed is used here to make sure the same fake data is generated for the test purposes
# In production, it is recommended to remove the faker_seed parameter (it will default to None)
faker_seed=42,
)
anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)
print_colored_pii(anonymizer.anonymize(document_content))
Date: 1986-04-18
Witness: Brian Cox DVM
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is Brian Cox DVM and on 1986-04-18, my wallet was stolen in the vicinity of New Rita during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number 6584801845146275, which is registered under my name and linked to my bank account, GB78GSWK37672423884969.
Additionally, the wallet had a driver's license - DL No: 781802744 issued to my name. It also houses my Social Security Number, 687-35-1170.
What's more, I had my polish identity card there, with the number [31m<POLISH_ID>[0m.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<TIME>[0m.
In case any information arises regarding my wallet, please reach out to me on my phone number, 7344131647, or through my personal email, jamesmichael@example.com.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, blakeerik@example.com.
My representative there is Cristian Santos (her business phone: 2812140441).
Thank you for your assistance,
Brian Cox DVM
As you can see, almost all values have been replaced with synthetic ones. The only exception is the Polish ID number and time, which are not supported by the default faker operators. We can add new operators to the anonymizer, which will generate random data.
5. Add Custom Operators (Optional)
While using placeholders or markers is a valid approach, it's often better to replace PII entities with synthetic data to improve the LLM's performance. We can add custom operators to the anonymizer to generate synthetic data for specific entity types:
from faker import Faker
from presidio_anonymizer.entities import OperatorConfig
fake = Faker()
def fake_polish_id(_=None):
""" Example output: 'VTC592627'"""
return fake.bothify(text="???######").upper()
def fake_time(_=None):
""" Example output: '03:14 PM'"""
return fake.time(pattern="%I:%M %p")
new_operators = {
"POLISH_ID": OperatorConfig("custom", {"lambda": fake_polish_id}),
"TIME": OperatorConfig("custom", {"lambda": fake_time}),
}
anonymizer.add_operators(new_operators)
Now, when we anonymize the text, the Polish ID and time entities will be replaced with synthetic data generated by our custom operators.
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))
Date: 1974-12-26
Witness: Jimmy Murillo
Subject: Testimony Regarding the Loss of Wallet
Testimony Content:
Hello Officer,
My name is Jimmy Murillo and on 1974-12-26, my wallet was stolen in the vicinity of South Dianeshire during a bike trip. This wallet contains some very important things to me.
Firstly, the wallet contains my credit card with number 213108121913614, which is registered under my name and linked to my bank account, GB17DBUR01326773602606.
Additionally, the wallet had a driver's license - DL No: 532311310 issued to my name. It also houses my Social Security Number, 690-84-1613.
What's more, I had my polish identity card there, with the number UFB745084.
I would like this data to be secured and protected in all possible ways. I believe It was stolen at 11:54 AM.
In case any information arises regarding my wallet, please reach out to me on my phone number, 876.931.1656, or through my personal email, briannasmith@example.net.
Please consider this information to be highly confidential and respect my privacy.
The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, samuel87@example.org.
My representative there is Joshua Blair (her business phone: 3361388464).
Thank you for your assistance,
Jimmy Murillo
pprint(anonymizer.deanonymizer_mapping)
{'CREDIT_CARD': {'213108121913614': '4111 1111 1111 1111'},
'DATE_TIME': {'1974-12-26': 'October 19, 2021'},
'EMAIL_ADDRESS': {'briannasmith@example.net': 'johndoe@example.com',
'samuel87@example.org': 'support@bankname.com'},
'IBAN_CODE': {'GB17DBUR01326773602606': 'PL61109010140000071219812874'},
'LOCATION': {'South Dianeshire': 'Kilmarnock'},
'PERSON': {'Jimmy Murillo': 'John Doe', 'Joshua Blair': 'Victoria Cherry'},
'PHONE_NUMBER': {'876.931.1656': '999-888-7777'},
'POLISH_ID': {'UFB745084': 'ABC123456'},
'TIME': {'11:54 AM': '9:30 AM'},
'UK_NHS': {'3361388464': '987-654-3210'},
'US_DRIVER_LICENSE': {'532311310': '999000680'},
'US_SSN': {'690-84-1613': '602-76-4532'}}
Voilà! Now all values are replaced with synthetic ones. Note that the deanonymizer mapping has been updated accordingly.
5. Building a Question-Answering System
With the anonymizer set up, we can integrate it into a question-answering system using LangChain:
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load and anonymize the data
documents = [Document(page_content=document_content)]
for doc in documents:
doc.page_content = anonymizer.anonymize(doc.page_content)
# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
# Index the chunks (using OpenAI embeddings, since the data is already anonymized)
embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(chunks, embeddings)
retriever = docsearch.as_retriever()
# Create the anonymizer chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI
template = """Answer the question based only on the following context: {context}
Question: {anonymized_question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0.3)
_inputs = RunnableParallel(
question=RunnablePassthrough(),
anonymized_question=RunnableLambda(anonymizer.anonymize),
)
anonymizer_chain = (
_inputs
| {
"context": itemgetter("anonymized_question") | retriever,
"anonymized_question": itemgetter("anonymized_question"),
}
| prompt
| model
| StrOutputParser()
)
anonymizer_chain.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?")
# 'The theft of the wallet occurred in the vicinity of New Rita during a bike trip. It was stolen from Brian Cox DVM. The time of the theft was 02:22 AM.'
# Add deanonymization step to the chain
chain_with_deanonymization = anonymizer_chain | RunnableLambda(anonymizer.deanonymize)
print(chain_with_deanonymization.invoke("Where did the theft of the wallet occur, at what time, and who was it stolen from?"))
# The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from John Doe. The time of the theft was 9:30 AM.
print(chain_with_deanonymization.invoke("What was the content of the wallet in detail?"))
# The content of the wallet included a credit card with the number 4111 1111 1111 1111, registered under the name of John Doe and linked to the bank account PL61109010140000071219812874. It also contained a driver's license with the number 999000680 issued to John Doe, as well as his Social Security Number 602-76-4532. Additionally, the wallet had a Polish identity card with the number ABC123456.
print(chain_with_deanonymization.invoke("Whose phone number is it: 999-888-7777?"))
# The phone number 999-888-7777 belongs to John Doe.
Alternative Approach: Local Embeddings
If you prefer to index the data in its original form or use custom embeddings, you can follow an alternative approach where the context is anonymized after indexing:
from operator import itemgetter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_core.prompts import format_document
from langchain_core.prompts.prompt import PromptTemplate
model_name = "BAAI/bge-base-en-v1.5"
encode_kwargs = {"normalize_embeddings": True}
local_embeddings = HuggingFaceBgeEmbeddings(
model_name=model_name, encode_kwargs=encode_kwargs, query_instruction="Represent this sentence for searching relevant passages:",
)
documents = [Document(page_content=document_content)]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
docsearch = FAISS.from_documents(chunks, local_embeddings)
retriever = docsearch.as_retriever()
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
def _combine_documents(docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"):
doc_strings = [format_document(doc, document_prompt) for doc in docs]
return document_separator.join(doc_strings)
# Anonymize the context after retrieval
chain_with_deanonymization = (
RunnableParallel({"question": RunnablePassthrough()})
| {
"context": itemgetter("question")
| retriever
| _combine_documents
| anonymizer.anonymize,
"anonymized_question": lambda x: anonymizer.anonymize(x["question"]),
}
| prompt
| model
| StrOutputParser()
| RunnableLambda(anonymizer.deanonymize)
)
This approach retrieves the original context from the vector database, anonymizes it on-the-fly, and then proceeds with the same anonymization and deanonymization steps as before.
Conclusion
In this blog post, we explored a solution for protecting private data when building question-answering systems with LLMs. By using LangChain and the Presidio library, we can create a secure and customizable anonymization pipeline that replaces sensitive information with placeholders or synthetic data before feeding it to the LLM.
We covered the essential concepts of PII, data anonymization, and the Presidio library. We then walked through the implementation details, including initializing the anonymizer, adding custom recognizers and operators, and integrating the anonymization process into a question-answering system using LangChain.
By following this approach, we can ensure that our private data remains protected while still benefiting from the capabilities of state-of-the-art LLMs and external APIs.
Source Code
You can find the complete source code for this blog post in the LangChain documentation: https://python.langchain.com/v0.1/docs/guides/productionization/safety/presidio_data_anonymization/qa_privacy_protection/