We have an Excel file in file store that contains more than 10,000 columns of JSON data in each excel file.
for example, a sample like below. ** note: i have Excel file in file store it contains more than 10,000 columns of JSON data .like 10,000 json i need to read whole excel and perform users transformation **
json 1:
{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4c-1e03ee9be97a,bart.van.someren@vroon.nl,85ff-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-6-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}
json2:
{"SearchName":"xiong.jie.wu2"",""Id":"6797200c-4-a40c-6d8cfe7d6c16"",""RequestType":"AuditSearch"",""StartDateUtc":"2023-12-01T00:00:00Z"",""EndDateUtc":"2024-01-26T00:00:00Z"",""RecordType":null",""RecordTypes":[]",""Workload":null",""Workloads":[]",""WorkloadsToInclude":null",""WorkloadsToExclude":null",""ScopedWorkloadSearchEnabled":false",""Operations":["copy"",""harddelete"",""movetodeleteditems"",""move"",""softdelete"",""new-inboxrule"",""set-inboxrule"",""updateinboxrules"",""add-mailboxpermission"",""addfolderpermissions"",""modifyfolderpermissions"]",""Users":["xiong@contoso.com"]",""ObjectIds":[]",""IPAddresses":[]",""SiteIds":null",""AssociatedAdminUnits":[]",""FreeText":""",""ResultSize":0",""TimeoutInSeconds":345600",""ScopedAdminWithoutAdminUnits":false}
...........................like this 10000 json in one excel file
We just want to change the user hash value to normal mask values.
like this: Alvin.Baltonado@ph.vroonshipmanagement.com
into sam@contoso.com
for whole excel file users.
Every time we manually copy users data and masking like below, it takes a lot of time for us. Then, after whatever output we got, we just replaced the hash value with the output.
import random
main=['nzn1@contoso.com', 'oman2@contoso.com', 'oman3@contoso.com', 'oman4@contoso.com', 'oman5@contoso.com', 'oman6@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaom6@contoso.com', 'omax7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omaom3@contoso.com', 'omax4@contoso.com', 'omax5@contoso.com', 'omax6@contoso.com', 'omaom7@contoso.com', 'omaw8@contoso.com', 'omaw9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com']
l=["0e07209b-807b-4938-8bfd-f87cee98e924,invoices@it.vroonoffshore.com,c747a82c-656e-40eb-9194-88c4a0f8061e"]
n=len(l)
print(n)
print(random.sample(main,n))
My question Is there any way in azure databricks replace whole Excel file user json key hash values to random users like this ss@contoso.com
at a time and rewrite back to particular location
I have tried the below approach: Install the below libraries:
%pip install openpyxl
%pip install xlrd
In the below code It masks email addresses within JSON data stored in an Excel file:
import pandas as pd
import json
import os
dbfs_input_path = 'dbfs:/FileStore/tables/sample_excel_file.xlsx'
local_input_path = '/tmp/sample_excel_file.xlsx'
dbutils.fs.cp(dbfs_input_path, 'file:' + local_input_path)
df = pd.read_excel(local_input_path)
def mask_emails(user_list):
return ["sam@contoso.com" for _ in user_list]
for col in df.columns:
for idx, json_data in df[col].items():
try:
json_obj = json.loads(json_data)
if 'Users' in json_obj:
json_obj['Users'] = mask_emails(json_obj['Users'])
df.at[idx, col] = json.dumps(json_obj)
except (json.JSONDecodeError, TypeError):
continue
local_output_path = '/tmp/transformed_excel_file.xlsx'
df.to_excel(local_output_path, index=False)
dbfs_output_path = 'dbfs:/FileStore/tables/transformed_excel_file.xlsx'
dbutils.fs.cp('file:' + local_output_path, dbfs_output_path)
print(f"Transformed file saved to {dbfs_output_path}")
Transformed file saved to dbfs:/FileStore/tables/transformed_excel_file.xlsx
In the above code using dbutils.fs.cp
to copy files between DBFS and the local filesystem.
And temporarily storing files locally for processing, and make sure they are copied back to DBFS.
Results:
dbfs_output_path = 'dbfs:/FileStore/tables/transformed_excel_file.xlsx'
local_output_path = '/tmp/transformed_excel_file.xlsx'
dbutils.fs.cp(dbfs_output_path, 'file:' + local_output_path)
df_transformed = pd.read_excel(local_output_path)
df_transformed.display()
Column1 Column2
{"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["sam@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false} {"SearchName": "xiong.jie.wu2", "Id": "6797200c-4-a40c-6d8cfe7d6c16", "RequestType": "AuditSearch", "StartDateUtc": "2023-12-01T00:00:00Z", "EndDateUtc": "2024-01-26T00:00:00Z", "RecordType": null, "RecordTypes": [], "Workload": null, "Workloads": [], "WorkloadsToInclude": null, "WorkloadsToExclude": null, "ScopedWorkloadSearchEnabled": false, "Operations": ["copy", "harddelete", "movetodeleteditems", "move", "softdelete", "new-inboxrule", "set-inboxrule", "updateinboxrules", "add-mailboxpermission", "addfolderpermissions", "modifyfolderpermissions"], "Users": ["sam@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}