Search code examples
pythonazure-databricksazure-data-lake-gen2

How to read .xpt format file from the Azure ADLS blob container and convert to csv format


Downloaded the .xpt format file from the URL to the blob container in Databricks - Python notebook.

In the below code - 'example.xpt' is the local file. How to read the .xpt format file from the blob container?

import xport.v56

with open('example.xpt', 'rb') as f:

    library = xport.v56.load(f)

Appreciate any inputs. Thanks!


Solution

  • Considering you have already installed the library xport in your cluster and mounted your ADLS blob container, you follow the steps given below:

    • Use the same code, except the path will be the .xpt file present in your blob container.
    import xport.v56
    
    with open('/dbfs/mnt/repro/ALQY_F.XPT', 'rb') as f:
        # '/dbfs/mnt/repro/' refers to the mount point i.e., to ADLS blob container.
        library = xport.v56.load(f)
    
    • The library object is of type class 'xport.v56.Library'. library has an attribute values which returns an iterable object.

    • Use the following code to write the required data to csv format in specified destination

    for data in library.values():
    
        print(type(data)) # <class 'xport.v56.Member'>
    
        print(dir(data)) # use to check all the possible attributes that can be used on this object
    
        data.to_csv("/dbfs/mnt/repro/op.csv") #writes as csv to your blob container.
    

    enter image description here

    Without Mounting:

    • With your client_id, tenant_id and client_secret, set up configurations for your ADLS storage using following command.
    spark.conf.set("fs.azure.account.auth.type.<adls_name>.dfs.core.windows.net", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type.<adls_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id.<adls_name>.dfs.core.windows.net", client_id)
    spark.conf.set("fs.azure.account.oauth2.client.secret.<adls_name>.dfs.core.windows.net", client_secret)
    spark.conf.set("fs.azure.account.oauth2.client.endpoint.<adls_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")
    
    • Now you can access your data lake storage container using abfss. Using abfss://[email protected]/example.xpt directly with open() does not work.

    enter image description here

    • Hence, use dbutils.fs.cp() to copy the file from abfss to DBFS.
    #copy from datalake to DBFS
    
    dbutils.fs.cp("abfss://data@<adls_name>.dfs.core.windows.net/<file_name>","dbfs:/<required_path>")
    
    • Now you can follow the procedure provided above to create csv from the data of xpt (save csv to dbfs and move to ADLS if required)