Search code examples
pysparkdatabricksazure-databricksazure-file-share

access azure files using azure databricks pyspark


I am trying to access a file which is Rds extension. I am using the below code however it is not helping.

import pandas as pd

url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.rds?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)

Solution

  • I created storage account and created file share and uploaded rds file into file share. Image for reference:

    I generated SAS key in storage account. Image for reference:

    I installed azure file shares in data bricks using

    pip install azure-storage-file 
    

    I installed pyreadr package to load rds file using

    pip install pyreadr
    

    enter image description here

    I tried to load the rds extension file in databrick using

    from azure.storage.file import FilePermissions, FileService
    from datetime import datetime, timedelta 
    import pyreadr
    from urllib.request import urlopen
    
    url_sas_token="<File Service SAS URL>"
    
    response = urlopen(url_sas_token)
    content = response.read()
    fhandle = open( 'counties.rds', 'wb')
    fhandle.write(content)
    fhandle.close()
    result = pyreadr.read_r("counties.rds")
    print(result)
    

    In above code I have given File Service SAS URL at url_sas_token.

    image for reference:

    Above code loaded rds file data successfully. Image for reference:

    In this way I accessed rds extension file which is in azure blob file share from data bricks.