Search code examples
javapythonnlpprivacyword-embedding

Identifying personnal information from column description


I have a question about the identification of GDPR (General Data Protection Regulation) related sentences. Is there a tool / method in Python, Java, ... that identifies whether a database column contains personnally identifiable information from its description only ?

We may think about using word embedding to get the "most_similar" or "most_similar_cosmul" words given a sentence and afterwards identifying keywords related to GDPR (biometric, personnal, id, photo...) but the results depend on the robustness of the word embedding model.

Thank you in advance,


Solution

  • There is no such thing as "personally identifiable information" in GDPR. The term (from GDPR article 4(1)) is "personal data", defined as:

    any information relating to an identified or identifiable natural person

    and it doesn't itself have to be identifying to qualify. What's an "identifiable natural person"? GDPR says:

    an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

    The key thing that turns regular "data" into "personal data" here is that "one or more factors" phrase. A single field, such as a phone number, could reasonably be considered as uniquely identifying a person. By itself a postal code probably doesn't, but when combined with a street address and a first name, we'd be very close to being able to identify someone, and hence all other data would become "personal". It's hard to evaluate whether a collection of fields is enough to uniquely identify someone or not – you might think that first name and city might not identify an individual, given "John" and "London", but "Esmerelda" and "Ulaanbaatar" might be pretty easy to track down, and it's the "worst case" that counts.

    For a simpler example: A colour value such as #663399 by itself is just plain "data", is not "personal data", and is not subject to GDPR. That exact same value stored as "favourite colour" in a field in a table linking that data to a person is personal data. "City" in a table of cities is not personal data, but a "city" field in a user table is.

    In short, you're not going to be able to do what you want. You can't tell whether a field is personal data or not from its name because you have insufficient context.