Search code examples
sqljsonsecurityhivemasking

Hive: How do we mask a particular key value pair ( containing PII data) in a json object, residing in a column, in Hive?


There is PII data that exists inside a key-value pair of a json object stored inside my Hive table in a column.Is there any way that I can mask only the value of the required key-value pair while keeping others unmasked?

{ "name":"John", "age":30, "car":null, "mob_no": "99999999"} 

I require field "mob_no" which is PII data and its associated value to be masked. Other key-value pairs of the json should be visible


Solution

  • You can use regexp_replace to replace some values in json with some string (see json_with_number_replaced column in the example below).

    Better apply sha256 instead of simple replacement, because this function is cryptographically strong (not reversible one way function), deterministic and collision tolerant. These properties of sha256 hash function allow you to join on obfuscated values(if using the same obfuscation method) and do aggregations. You can still count different mobile_numbers after obfuscation. See mobile_number_obfuscated and json_with_number_obfuscated columns calculation in the example below:

    select json original_json,
           s.original_mobile_number,
           s.mobile_number_obfuscated,
           regexp_replace(s.json, '(mob_no":[\\s]?")([+]?[\\d]+)','$1xx') as json_with_number_replaced,
           regexp_replace(s.json, '(mob_no":[\\s]?")([+]?[\\d]+)',concat('$1',mobile_number_obfuscated)) as json_with_number_obfuscated
    from
    (
    select regexp_extract(s.json, 'mob_no":[\\s]?"([+]?[\\d]+)',1) as original_mobile_number,
           java_method('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', regexp_extract(s.json, 'mob_no":[\\s]?"([+]?[\\d]+)',1)) mobile_number_obfuscated,
           s.json
    from
    (--original data
    select '{ "name":"John", "age":30, "car":null, "mob_no": "+99999999"}' as json
    )s
    )s;
    
    OK
    original_json                                                   original_mobile_number  mobile_number_obfuscated                                                json_with_number_replaced                               json_with_number_obfuscated
    { "name":"John", "age":30, "car":null, "mob_no": "+99999999"}   +99999999               98ae38dddc1994179e21d104feb7b09e5627953d9fe9b9851239ac445b6de3cd        { "name":"John", "age":30, "car":null, "mob_no": "xx"}  { "name":"John", "age":30, "car":null, "mob_no": "98ae38dddc1994179e21d104feb7b09e5627953d9fe9b9851239ac445b6de3cd"}
    Time taken: 1.988 seconds, Fetched: 1 row(s)
    

    In this example I demonstrated how you can use regexp_replace to replace value for some key in the JSON, also how to use sha256 to obfuscate your PII data.