We have an Excel file in file store that contains more than 10,000 columns of JSON data in each excel file.
for example, a sample like below.
json:
{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4-851c-4af2-8ddc-1e03ee9be97a,bart.van.someren@vroon.nl,85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-659c-44d3-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltonado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}
We just want to change the user hash value to normal mask values.
like this: Alvin.Baltonado@ph.vroonshipmanagement.com
into sam@contoso.com
for whole excel file users.
Every time we manually copy users data and masking like below, it takes a lot of time for us. Then, after whatever output we got, we just replaced the hash value with the output.
import random
main=['nzn1@contoso.com', 'oman2@contoso.com', 'oman3@contoso.com', 'oman4@contoso.com', 'oman5@contoso.com', 'oman6@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaom6@contoso.com', 'omax7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omaom3@contoso.com', 'omax4@contoso.com', 'omax5@contoso.com', 'omax6@contoso.com', 'omaom7@contoso.com', 'omaw8@contoso.com', 'omaw9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com', 'omae6@contoso.com', 'omae7@contoso.com', 'omae8@contoso.com', 'omae9@contoso.com', 'omaf1@contoso.com', 'omaf2@contoso.com', 'omaf3@contoso.com', 'omaf4@contoso.com', 'omaf5@contoso.com', 'omaf6@contoso.com', 'omaf7@contoso.com', 'omaf8@contoso.com', 'omaf9@contoso.com', 'omag1@contoso.com', 'omag2@contoso.com', 'omag3@contoso.com', 'omag4@contoso.com', 'omah1@contoso.com', 'omay5@contoso.com', 'omal1@contoso.com', 'omam6@contoso.com', 'omao1@contoso.com', 'omaom2@contoso.com', 'omao4@contoso.com', 'omag5@contoso.com', 'omah2@contoso.com', 'omay6@contoso.com', 'omal2@contoso.com', 'omam7@contoso.com', 'omao2@contoso.com', 'omao5@contoso.com', 'omao6@contoso.com', 'omag6@contoso.com', 'omah3@contoso.com', 'omay7@contoso.com', 'omal3@contoso.com', 'omam8@contoso.com', 'omao3@contoso.com', 'omao7@contoso.com', 'omao9@contoso.com', 'omag7@contoso.com', 'omah4@contoso.com', 'omaq1@contoso.com', 'omaq3@contoso.com', 'omai3@contoso.com', 'omah8@contoso.com', 'omax3@contoso.com', 'omal8@contoso.com', 'oman4@contoso.com', 'omaq2@contoso.com', 'omaq4@contoso.com', 'omaq5@contoso.com', 'omai4@contoso.com', 'omah9@contoso.com', 'omaom4@contoso.com', 'omal9@contoso.com', 'oman5@contoso.com', 'omaom3@contoso.com', 'omaq6@contoso.com', 'omaq8@contoso.com', 'omai5@contoso.com', 'omai1@contoso.com', 'omaom5@contoso.com', 'omam1@contoso.com', 'oman6@contoso.com', 'omaq7@contoso.com', 'omaq9@contoso.com', 'omar2@contoso.com', 'omai6@contoso.com', 'omaw1@contoso.com', 'omaz6@contoso.com', 'omam2@contoso.com', 'oman7@contoso.com', 'omar1@contoso.com', 'omar3@contoso.com', 'omar5@contoso.com', 'omai7@contoso.com', 'omaw2@contoso.com', 'omaz7@contoso.com', 'omam3@contoso.com', 'oman8@contoso.com', 'omar4@contoso.com', 'omar6@contoso.com', 'omar8@contoso.com', 'omai8@contoso.com', 'omay3@contoso.com', 'omaom8@contoso.com', 'omam4@contoso.com', 'oman9@contoso.com', 'omar7@contoso.com', 'omar9@contoso.com', 'omaom1@contoso.com', 'omai9@contoso.com', 'omay4@contoso.com', 'omax9@contoso.com', 'omam5@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaz6@contoso.com', 'omaz7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omay3@contoso.com', 'omay4@contoso.com', 'omay5@contoso.com', 'omay6@contoso.com', 'omay7@contoso.com', 'omay8@contoso.com', 'omay9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com', 'omae6@contoso.com', 'omae7@contoso.com', 'omaan1@contoso.com', 'omaan2@contoso.com', 'omaan3@contoso.com', 'omaan4@contoso.com', 'omaan5@contoso.com', 'omaan6@contoso.com']
l=["0e07209b-807b-4938-8bfd-f87cee98e924,invoices@it.vroonoffshore.com,c747a82c-656e-40eb-9194-88c4a0f8061e"]
n=len(l)
print(n)
print(random.sample(main,n))
My question Is there any way in azure databricks replace whole Excel file user json key hash values to random users like this ss@contoso.com
at a time and rewrite back to particular location
As you mentioned you want to change the user hash value to normal mask values.
I have tried the below approach:
sample_json = """
{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4-851c-4af2-8ddc-1e03ee9be97a,bart.van.someren@vroon.nl,85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-659c-44d3-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltonado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}
"""
masked_emails = [
'ss1@contoso.com', 'ss2@contoso.com', 'ss3@contoso.com', 'ss4@contoso.com', 'ss5@contoso.com',
'ss6@contoso.com', 'ss7@contoso.com', 'ss8@contoso.com', 'ss9@contoso.com', 'ss10@contoso.com',
'ss11@contoso.com', 'ss12@contoso.com', 'ss13@contoso.com', 'ss14@contoso.com', 'ss15@contoso.com',
'ss16@contoso.com', 'ss17@contoso.com', 'ss18@contoso.com', 'ss19@contoso.com', 'ss20@contoso.com'
]
def mask_emails(json_str):
try:
data = json.loads(json_str)
users = data.get("Users", [])
if users:
original_emails = users[0].split(',')
masked = random.sample(masked_emails, len(original_emails))
data["Users"] = [','.join(masked)]
return json.dumps(data)
except Exception as e:
return json_str
mask_emails_udf = udf(mask_emails, StringType())
data = [(sample_json,)]
df = spark.createDataFrame(data, ["json_column"])
df = df.withColumn("transformed", mask_emails_udf(col("json_column")))
display(df)
In the above code I am masking email addresses in JSON data. Reading the JSON data, replacing the user email addresses with masked values, and then writing the transformed JSON back.
Results:
transformed
{"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["ss18@contoso.com,ss17@contoso.com,ss9@contoso.com,ss1@contoso.com,ss2@contoso.com,ss20@contoso.com,ss16@contoso.com,ss15@contoso.com,ss6@contoso.com,ss7@contoso.com,ss11@contoso.com,ss4@contoso.com,ss12@contoso.com,ss8@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}
[Errno 2] No such file or directory: '/dbfs/FileStore/Book111.xlsx'
Error mentions that the Python interpreter is unable to locate the specified Excel file at the given path in your DBFS.
To resolve the above ERROR and write the results as xlsx format I have tried the below approach:
pandas_df = df.toPandas()
dir_path = "/FileStore/tables/"
dbutils.fs.mkdirs(dir_path)
file_path = dir_path + "transformed_data.xlsx"
local_file_path = "/tmp/transformed_data.xlsx"
pandas_df.to_excel(local_file_path, index=False, engine='openpyxl')
dbutils.fs.cp("file:" + local_file_path, "dbfs:" + file_path)
print(f"File saved to {file_path}")
Results:
dbutils.fs.ls("/FileStore/tables/transformed_data.xlsx")
[FileInfo(path='dbfs:/FileStore/tables/transformed_data.xlsx', name='transformed_data.xlsx', size=5623, modificationTime=1720070340000)]