Can someone help modify the following PySpark code to mask certain characters in a field. As it stands the code will mask all the characters in a field, for example it will turn the following email address from [email protected] to 7cdc15144825a91b55330425e3d109df77a31baf9b7e9a4597a047bf20470178.
However, I would like to only mask say the first five characters with ***** so it appears as ******[email protected]
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
import hashlib
class Mask:
def __init__(self, salt: str):
self.salt = salt
def sha512(self, value):
return hashlib.sha512(f'{value}{self.salt}'.encode()).hexdigest()
def shake_128(self, value):
return hashlib.shake_128(f'{value}{self.salt}'.encode()).hexdigest(32)
def register(self, spark: SparkSession):
spark.udf.register('sha512', self.sha512)
spark.udf.register('shake128', self.shake_128)
spark = SparkSession.builder.getOrCreate()
m= Mask('123456789')
m.register(spark)
To use the code to mask email address would be as follows
spark.read \
.format('csv') \
.option('inferSchema', True) \
.option('header', True) \
.load(path) \
.selectExpr(['user_name', 'shake128(email)']) \
.write \
.mode('append') \
.saveAsTable('my_table')
You can use pyspark.sql.functions
's regexp_replace
function for that:
import pyspark.sql.functions as F
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
("[email protected]",),
("[email protected]",),
("[email protected]",),
("[email protected]",),
],
["email"],
)
df.select(
F.regexp_replace("email", r"(^[^@]{1,5})", "*****").alias("masked_emails")
).show(truncate=False)
+------------------------------+
|masked_emails |
+------------------------------+
|*****[email protected]|
|*****[email protected] |
|*****@hotmail.com |
|*****[email protected] |
+------------------------------+
NOTE: if the part before the @
sign is less than 5 characters long you'll still get 5 *
signs. I imagine that is no problem since you're trying to mask data.