Search code examples
apache-sparkpysparkdatabricks

PySpark Code for Data Masking Modification


Can someone help modify the following PySpark code to mask certain characters in a field. As it stands the code will mask all the characters in a field, for example it will turn the following email address from [email protected] to 7cdc15144825a91b55330425e3d109df77a31baf9b7e9a4597a047bf20470178.

However, I would like to only mask say the first five characters with ***** so it appears as ******[email protected]

from pyspark.sql import SparkSession  
from pyspark.sql.functions import udf  
import hashlib  
  
class Mask:  
    def __init__(self, salt: str):  
        self.salt = salt  
      
    def sha512(self, value):  
        return hashlib.sha512(f'{value}{self.salt}'.encode()).hexdigest()  
  
    def shake_128(self, value):  
        return hashlib.shake_128(f'{value}{self.salt}'.encode()).hexdigest(32)  
  
    def register(self, spark: SparkSession):  
        spark.udf.register('sha512', self.sha512)  
        spark.udf.register('shake128', self.shake_128)


spark = SparkSession.builder.getOrCreate()  
m= Mask('123456789')  
m.register(spark) 


To use the code to mask email address would be as follows

spark.read \  
    .format('csv') \  
    .option('inferSchema', True) \  
    .option('header', True) \  
    .load(path) \  
    .selectExpr(['user_name', 'shake128(email)']) \  
    .write \  
    .mode('append') \  
    .saveAsTable('my_table')

Solution

  • You can use pyspark.sql.functions's regexp_replace function for that:

    import pyspark.sql.functions as F
    from pyspark.sql.session import SparkSession
    
    spark = SparkSession.builder.getOrCreate()
    
    df = spark.createDataFrame(
        [
            ("[email protected]",),
            ("[email protected]",),
            ("[email protected]",),
            ("[email protected]",),
        ],
        ["email"],
    )
    
    df.select(
        F.regexp_replace("email", r"(^[^@]{1,5})", "*****").alias("masked_emails")
    ).show(truncate=False)
    +------------------------------+
    |masked_emails                 |
    +------------------------------+
    |*****[email protected]|
    |*****[email protected]  |
    |*****@hotmail.com             |
    |*****[email protected]        |
    +------------------------------+
    

    NOTE: if the part before the @ sign is less than 5 characters long you'll still get 5 * signs. I imagine that is no problem since you're trying to mask data.