python azure-databricks azure-data-lake-gen2

How to read .xpt format file from the Azure ADLS blob container and convert to csv format

Downloaded the .xpt format file from the URL to the blob container in Databricks - Python notebook.

In the below code - 'example.xpt' is the local file. How to read the .xpt format file from the blob container?

import xport.v56

with open('example.xpt', 'rb') as f:

    library = xport.v56.load(f)

Appreciate any inputs. Thanks!

Solution

Considering you have already installed the library xport in your cluster and mounted your ADLS blob container, you follow the steps given below:

Use the same code, except the path will be the .xpt file present in your blob container.

import xport.v56

with open('/dbfs/mnt/repro/ALQY_F.XPT', 'rb') as f:
    # '/dbfs/mnt/repro/' refers to the mount point i.e., to ADLS blob container.
    library = xport.v56.load(f)

The library object is of type class 'xport.v56.Library'. library has an attribute values which returns an iterable object.
Use the following code to write the required data to csv format in specified destination

for data in library.values():

    print(type(data)) # <class 'xport.v56.Member'>

    print(dir(data)) # use to check all the possible attributes that can be used on this object

    data.to_csv("/dbfs/mnt/repro/op.csv") #writes as csv to your blob container.

Without Mounting:

With your client_id, tenant_id and client_secret, set up configurations for your ADLS storage using following command.

spark.conf.set("fs.azure.account.auth.type.<adls_name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<adls_name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<adls_name>.dfs.core.windows.net", client_id)
spark.conf.set("fs.azure.account.oauth2.client.secret.<adls_name>.dfs.core.windows.net", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<adls_name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant_id>/oauth2/token")

Now you can access your data lake storage container using abfss. Using abfss://[email protected]/example.xpt directly with open() does not work.

Hence, use dbutils.fs.cp() to copy the file from abfss to DBFS.

#copy from datalake to DBFS

dbutils.fs.cp("abfss://data@<adls_name>.dfs.core.windows.net/<file_name>","dbfs:/<required_path>")

Now you can follow the procedure provided above to create csv from the data of xpt (save csv to dbfs and move to ADLS if required)