python excel azure pyspark azure-databricks

how to mask hash users into random values name@domain.com in azure databricks

We have an Excel file in file store that contains more than 10,000 columns of JSON data in each excel file.

for example, a sample like below.

json:

{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4-851c-4af2-8ddc-1e03ee9be97a,bart.van.someren@vroon.nl,85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-659c-44d3-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltonado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}

We just want to change the user hash value to normal mask values.
like this: Alvin.Baltonado@ph.vroonshipmanagement.com into sam@contoso.com for whole excel file users.

Every time we manually copy users data and masking like below, it takes a lot of time for us. Then, after whatever output we got, we just replaced the hash value with the output.

import random  
  
main=['nzn1@contoso.com', 'oman2@contoso.com', 'oman3@contoso.com', 'oman4@contoso.com', 'oman5@contoso.com', 'oman6@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaom6@contoso.com', 'omax7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omaom3@contoso.com', 'omax4@contoso.com', 'omax5@contoso.com', 'omax6@contoso.com', 'omaom7@contoso.com', 'omaw8@contoso.com', 'omaw9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com', 'omae6@contoso.com', 'omae7@contoso.com', 'omae8@contoso.com', 'omae9@contoso.com', 'omaf1@contoso.com', 'omaf2@contoso.com', 'omaf3@contoso.com', 'omaf4@contoso.com', 'omaf5@contoso.com', 'omaf6@contoso.com', 'omaf7@contoso.com', 'omaf8@contoso.com', 'omaf9@contoso.com', 'omag1@contoso.com', 'omag2@contoso.com', 'omag3@contoso.com', 'omag4@contoso.com', 'omah1@contoso.com', 'omay5@contoso.com', 'omal1@contoso.com', 'omam6@contoso.com', 'omao1@contoso.com', 'omaom2@contoso.com', 'omao4@contoso.com', 'omag5@contoso.com', 'omah2@contoso.com', 'omay6@contoso.com', 'omal2@contoso.com', 'omam7@contoso.com', 'omao2@contoso.com', 'omao5@contoso.com', 'omao6@contoso.com', 'omag6@contoso.com', 'omah3@contoso.com', 'omay7@contoso.com', 'omal3@contoso.com', 'omam8@contoso.com', 'omao3@contoso.com', 'omao7@contoso.com', 'omao9@contoso.com', 'omag7@contoso.com', 'omah4@contoso.com', 'omaq1@contoso.com', 'omaq3@contoso.com', 'omai3@contoso.com', 'omah8@contoso.com', 'omax3@contoso.com', 'omal8@contoso.com', 'oman4@contoso.com', 'omaq2@contoso.com', 'omaq4@contoso.com', 'omaq5@contoso.com', 'omai4@contoso.com', 'omah9@contoso.com', 'omaom4@contoso.com', 'omal9@contoso.com', 'oman5@contoso.com', 'omaom3@contoso.com', 'omaq6@contoso.com', 'omaq8@contoso.com', 'omai5@contoso.com', 'omai1@contoso.com', 'omaom5@contoso.com', 'omam1@contoso.com', 'oman6@contoso.com', 'omaq7@contoso.com', 'omaq9@contoso.com', 'omar2@contoso.com', 'omai6@contoso.com', 'omaw1@contoso.com', 'omaz6@contoso.com', 'omam2@contoso.com', 'oman7@contoso.com', 'omar1@contoso.com', 'omar3@contoso.com', 'omar5@contoso.com', 'omai7@contoso.com', 'omaw2@contoso.com', 'omaz7@contoso.com', 'omam3@contoso.com', 'oman8@contoso.com', 'omar4@contoso.com', 'omar6@contoso.com', 'omar8@contoso.com', 'omai8@contoso.com', 'omay3@contoso.com', 'omaom8@contoso.com', 'omam4@contoso.com', 'oman9@contoso.com', 'omar7@contoso.com', 'omar9@contoso.com', 'omaom1@contoso.com', 'omai9@contoso.com', 'omay4@contoso.com', 'omax9@contoso.com', 'omam5@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaz6@contoso.com', 'omaz7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omay3@contoso.com', 'omay4@contoso.com', 'omay5@contoso.com', 'omay6@contoso.com', 'omay7@contoso.com', 'omay8@contoso.com', 'omay9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com', 'omae6@contoso.com', 'omae7@contoso.com', 'omaan1@contoso.com', 'omaan2@contoso.com', 'omaan3@contoso.com', 'omaan4@contoso.com', 'omaan5@contoso.com', 'omaan6@contoso.com']  
  
l=["0e07209b-807b-4938-8bfd-f87cee98e924,invoices@it.vroonoffshore.com,c747a82c-656e-40eb-9194-88c4a0f8061e"]  
n=len(l)  
print(n)  
print(random.sample(main,n))

My question Is there any way in azure databricks replace whole Excel file user json key hash values to random users like this ss@contoso.com at a time and rewrite back to particular location

Solution

As you mentioned you want to change the user hash value to normal mask values.

I have tried the below approach:

sample_json = """
{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4-851c-4af2-8ddc-1e03ee9be97a,bart.van.someren@vroon.nl,85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-659c-44d3-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltonado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}
"""
masked_emails = [
    'ss1@contoso.com', 'ss2@contoso.com', 'ss3@contoso.com', 'ss4@contoso.com', 'ss5@contoso.com', 
    'ss6@contoso.com', 'ss7@contoso.com', 'ss8@contoso.com', 'ss9@contoso.com', 'ss10@contoso.com', 
    'ss11@contoso.com', 'ss12@contoso.com', 'ss13@contoso.com', 'ss14@contoso.com', 'ss15@contoso.com', 
    'ss16@contoso.com', 'ss17@contoso.com', 'ss18@contoso.com', 'ss19@contoso.com', 'ss20@contoso.com'
]
def mask_emails(json_str):
    try:
        data = json.loads(json_str)
        users = data.get("Users", [])
        if users:
            original_emails = users[0].split(',')
            masked = random.sample(masked_emails, len(original_emails))
            data["Users"] = [','.join(masked)]
        return json.dumps(data)
    except Exception as e:
        return json_str
mask_emails_udf = udf(mask_emails, StringType())
data = [(sample_json,)]
df = spark.createDataFrame(data, ["json_column"])
df = df.withColumn("transformed", mask_emails_udf(col("json_column")))
display(df)

In the above code I am masking email addresses in JSON data. Reading the JSON data, replacing the user email addresses with masked values, and then writing the transformed JSON back.

Results:

transformed
{"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["ss18@contoso.com,ss17@contoso.com,ss9@contoso.com,ss1@contoso.com,ss2@contoso.com,ss20@contoso.com,ss16@contoso.com,ss15@contoso.com,ss6@contoso.com,ss7@contoso.com,ss11@contoso.com,ss4@contoso.com,ss12@contoso.com,ss8@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}

[Errno 2] No such file or directory: '/dbfs/FileStore/Book111.xlsx'

Error mentions that the Python interpreter is unable to locate the specified Excel file at the given path in your DBFS.

To resolve the above ERROR and write the results as xlsx format I have tried the below approach:

pandas_df = df.toPandas()
dir_path = "/FileStore/tables/"
dbutils.fs.mkdirs(dir_path)
file_path = dir_path + "transformed_data.xlsx"
local_file_path = "/tmp/transformed_data.xlsx"
pandas_df.to_excel(local_file_path, index=False, engine='openpyxl')
dbutils.fs.cp("file:" + local_file_path, "dbfs:" + file_path)
print(f"File saved to {file_path}")

Results:

dbutils.fs.ls("/FileStore/tables/transformed_data.xlsx")

[FileInfo(path='dbfs:/FileStore/tables/transformed_data.xlsx', name='transformed_data.xlsx', size=5623, modificationTime=1720070340000)]