There is PII data that exists inside a key-value pair of a json object stored inside my Hive table in a column.Is there any way that I can mask only the value of the required key-value pair while keeping others unmasked?
{ "name":"John", "age":30, "car":null, "mob_no": "99999999"}
I require field "mob_no"
which is PII data and its associated value to be masked. Other key-value pairs of the json should be visible
You can use regexp_replace
to replace some values in json with some string (see json_with_number_replaced
column in the example below).
Better apply sha256
instead of simple replacement, because this function is cryptographically strong (not reversible one way function), deterministic and collision tolerant. These properties of sha256 hash function allow you to join on obfuscated values(if using the same obfuscation method) and do aggregations. You can still count different mobile_numbers after obfuscation. See mobile_number_obfuscated
and json_with_number_obfuscated
columns calculation in the example below:
select json original_json,
s.original_mobile_number,
s.mobile_number_obfuscated,
regexp_replace(s.json, '(mob_no":[\\s]?")([+]?[\\d]+)','$1xx') as json_with_number_replaced,
regexp_replace(s.json, '(mob_no":[\\s]?")([+]?[\\d]+)',concat('$1',mobile_number_obfuscated)) as json_with_number_obfuscated
from
(
select regexp_extract(s.json, 'mob_no":[\\s]?"([+]?[\\d]+)',1) as original_mobile_number,
java_method('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', regexp_extract(s.json, 'mob_no":[\\s]?"([+]?[\\d]+)',1)) mobile_number_obfuscated,
s.json
from
(--original data
select '{ "name":"John", "age":30, "car":null, "mob_no": "+99999999"}' as json
)s
)s;
OK
original_json original_mobile_number mobile_number_obfuscated json_with_number_replaced json_with_number_obfuscated
{ "name":"John", "age":30, "car":null, "mob_no": "+99999999"} +99999999 98ae38dddc1994179e21d104feb7b09e5627953d9fe9b9851239ac445b6de3cd { "name":"John", "age":30, "car":null, "mob_no": "xx"} { "name":"John", "age":30, "car":null, "mob_no": "98ae38dddc1994179e21d104feb7b09e5627953d9fe9b9851239ac445b6de3cd"}
Time taken: 1.988 seconds, Fetched: 1 row(s)
In this example I demonstrated how you can use regexp_replace to replace value for some key in the JSON, also how to use sha256 to obfuscate your PII data.