Search code examples
pythonexcelazurepysparkazure-databricks

how to mask hash users into random values [email protected] in azure databricks


We have an Excel file in file store that contains more than 10,000 columns of JSON data in each excel file.

for example, a sample like below.

json:

{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,[email protected],ab6019a4-851c-4af2-8ddc-1e03ee9be97a,[email protected],85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,[email protected],48168530-659c-44d3-8985-65f9b0af2b85,[email protected],0937a1e5-8a68-4573-ae9c-e13f9a2f3617,[email protected],c822dd8b-0b79-4c13-af1e-bc080b8108c5,[email protected],ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,[email protected]"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}

We just want to change the user hash value to normal mask values.
like this: [email protected] into [email protected] for whole excel file users.

Every time we manually copy users data and masking like below, it takes a lot of time for us. Then, after whatever output we got, we just replaced the hash value with the output.

import random  
  
main=['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']  
  
l=["0e07209b-807b-4938-8bfd-f87cee98e924,[email protected],c747a82c-656e-40eb-9194-88c4a0f8061e"]  
n=len(l)  
print(n)  
print(random.sample(main,n))

My question Is there any way in azure databricks replace whole Excel file user json key hash values to random users like this [email protected] at a time and rewrite back to particular location


Solution

  • As you mentioned you want to change the user hash value to normal mask values.

    I have tried the below approach:

    sample_json = """
    {"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,[email protected],ab6019a4-851c-4af2-8ddc-1e03ee9be97a,[email protected],85ff7cda-5f2d-4d32-b51c-b88ad4d55b5a,[email protected],48168530-659c-44d3-8985-65f9b0af2b85,[email protected],0937a1e5-8a68-4573-ae9c-e13f9a2f3617,[email protected],c822dd8b-0b79-4c13-af1e-bc080b8108c5,[email protected],ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,[email protected]"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}
    """
    masked_emails = [
        '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 
        '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 
        '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', 
        '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'
    ]
    def mask_emails(json_str):
        try:
            data = json.loads(json_str)
            users = data.get("Users", [])
            if users:
                original_emails = users[0].split(',')
                masked = random.sample(masked_emails, len(original_emails))
                data["Users"] = [','.join(masked)]
            return json.dumps(data)
        except Exception as e:
            return json_str
    mask_emails_udf = udf(mask_emails, StringType())
    data = [(sample_json,)]
    df = spark.createDataFrame(data, ["json_column"])
    df = df.withColumn("transformed", mask_emails_udf(col("json_column")))
    display(df)
    

    In the above code I am masking email addresses in JSON data. Reading the JSON data, replacing the user email addresses with masked values, and then writing the transformed JSON back.

    Results:

    transformed
    {"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}
    
    

    [Errno 2] No such file or directory: '/dbfs/FileStore/Book111.xlsx'

    Error mentions that the Python interpreter is unable to locate the specified Excel file at the given path in your DBFS.

    To resolve the above ERROR and write the results as xlsx format I have tried the below approach:

    pandas_df = df.toPandas()
    dir_path = "/FileStore/tables/"
    dbutils.fs.mkdirs(dir_path)
    file_path = dir_path + "transformed_data.xlsx"
    local_file_path = "/tmp/transformed_data.xlsx"
    pandas_df.to_excel(local_file_path, index=False, engine='openpyxl')
    dbutils.fs.cp("file:" + local_file_path, "dbfs:" + file_path)
    print(f"File saved to {file_path}")
    

    Results:

    dbutils.fs.ls("/FileStore/tables/transformed_data.xlsx")
    
    [FileInfo(path='dbfs:/FileStore/tables/transformed_data.xlsx', name='transformed_data.xlsx', size=5623, modificationTime=1720070340000)]