how to hash users into random values in azure databricks

We have an Excel file in file store that contains more than 10,000 columns of JSON data in each excel file.

for example, a sample like below. ** note: i have Excel file in file store it contains more than 10,000 columns of JSON data .like 10,000 json i need to read whole excel and perform users transformation **

json 1:

{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4c-1e03ee9be97a,bart.van.someren@vroon.nl,85ff-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-6-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}

json2:

{"SearchName":"xiong.jie.wu2"",""Id":"6797200c-4-a40c-6d8cfe7d6c16"",""RequestType":"AuditSearch"",""StartDateUtc":"2023-12-01T00:00:00Z"",""EndDateUtc":"2024-01-26T00:00:00Z"",""RecordType":null",""RecordTypes":[]",""Workload":null",""Workloads":[]",""WorkloadsToInclude":null",""WorkloadsToExclude":null",""ScopedWorkloadSearchEnabled":false",""Operations":["copy"",""harddelete"",""movetodeleteditems"",""move"",""softdelete"",""new-inboxrule"",""set-inboxrule"",""updateinboxrules"",""add-mailboxpermission"",""addfolderpermissions"",""modifyfolderpermissions"]",""Users":["xiong@contoso.com"]",""ObjectIds":[]",""IPAddresses":[]",""SiteIds":null",""AssociatedAdminUnits":[]",""FreeText":""",""ResultSize":0",""TimeoutInSeconds":345600",""ScopedAdminWithoutAdminUnits":false}

...........................like this 10000 json in one excel file

We just want to change the user hash value to normal mask values.
like this: Alvin.Baltonado@ph.vroonshipmanagement.com into sam@contoso.com for whole excel file users.

Every time we manually copy users data and masking like below, it takes a lot of time for us. Then, after whatever output we got, we just replaced the hash value with the output.

import random

main=['nzn1@contoso.com', 'oman2@contoso.com', 'oman3@contoso.com', 'oman4@contoso.com', 'oman5@contoso.com', 'oman6@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaom6@contoso.com', 'omax7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omaom3@contoso.com', 'omax4@contoso.com', 'omax5@contoso.com', 'omax6@contoso.com', 'omaom7@contoso.com', 'omaw8@contoso.com', 'omaw9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com']  
  
l=["0e07209b-807b-4938-8bfd-f87cee98e924,invoices@it.vroonoffshore.com,c747a82c-656e-40eb-9194-88c4a0f8061e"]  
n=len(l)  
print(n)  
print(random.sample(main,n))

My question Is there any way in azure databricks replace whole Excel file user json key hash values to random users like this ss@contoso.com at a time and rewrite back to particular location

Solution

I have tried the below approach: Install the below libraries:

%pip install openpyxl
%pip install xlrd

In the below code It masks email addresses within JSON data stored in an Excel file:

import pandas as pd
import json
import os
dbfs_input_path = 'dbfs:/FileStore/tables/sample_excel_file.xlsx'
local_input_path = '/tmp/sample_excel_file.xlsx'
dbutils.fs.cp(dbfs_input_path, 'file:' + local_input_path)
df = pd.read_excel(local_input_path)
def mask_emails(user_list):
    return ["sam@contoso.com" for _ in user_list]
for col in df.columns:
    for idx, json_data in df[col].items():
        try:
            json_obj = json.loads(json_data)
            if 'Users' in json_obj:
                json_obj['Users'] = mask_emails(json_obj['Users'])
            df.at[idx, col] = json.dumps(json_obj)
        except (json.JSONDecodeError, TypeError):
            continue  


local_output_path = '/tmp/transformed_excel_file.xlsx'
df.to_excel(local_output_path, index=False)
dbfs_output_path = 'dbfs:/FileStore/tables/transformed_excel_file.xlsx'
dbutils.fs.cp('file:' + local_output_path, dbfs_output_path)
print(f"Transformed file saved to {dbfs_output_path}")

Transformed file saved to dbfs:/FileStore/tables/transformed_excel_file.xlsx

In the above code using dbutils.fs.cp to copy files between DBFS and the local filesystem. And temporarily storing files locally for processing, and make sure they are copied back to DBFS.

Results:

dbfs_output_path = 'dbfs:/FileStore/tables/transformed_excel_file.xlsx'
local_output_path = '/tmp/transformed_excel_file.xlsx'
dbutils.fs.cp(dbfs_output_path, 'file:' + local_output_path)
df_transformed = pd.read_excel(local_output_path)
df_transformed.display()

Column1 Column2
{"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["sam@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}   {"SearchName": "xiong.jie.wu2", "Id": "6797200c-4-a40c-6d8cfe7d6c16", "RequestType": "AuditSearch", "StartDateUtc": "2023-12-01T00:00:00Z", "EndDateUtc": "2024-01-26T00:00:00Z", "RecordType": null, "RecordTypes": [], "Workload": null, "Workloads": [], "WorkloadsToInclude": null, "WorkloadsToExclude": null, "ScopedWorkloadSearchEnabled": false, "Operations": ["copy", "harddelete", "movetodeleteditems", "move", "softdelete", "new-inboxrule", "set-inboxrule", "updateinboxrules", "add-mailboxpermission", "addfolderpermissions", "modifyfolderpermissions"], "Users": ["sam@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}