Search code examples
pythonazuredatabricksazure-databricks

How to access file/tables in dbfs storage of one ADB workspace from a totally different ADB Workspace


I want to access some files stored in dbfs location of ADB-1(Azure databricks) workspace from ADB-2 workspace. Just wanted to understand if its possible to query another workspace dbfs storage. I tried googling this question a lot but so far no luck. Please help!!

For example - in ADB-1 workspace i have file store in /dbfs/FileStore/Zone1/File1.csv . To read File1.csv from ADB-2 workspace, how can this be done? Any suggestions


Solution

  • There is an API in Databricks i.e., DBFS 2.0 where there is a GET method. But there are few limitations.

    • Look at the following demonstration. Let's say dbfs1 is where the required files are present. You can use the below code to read the file from 2nd Databricks notebook.
    import requests
    import json
    
    my_json = { "path": "/FileStore/sample1.csv", "offset": 0, "length": 1000000 }  #path is from dbfs1
    
    auth = {"Authorization": "Bearer <access_token>"}
    
    #url of databricks workspace where the file is present
    response = requests.get('https://<databricks_workspace_url>/api/2.0/dbfs/read', json = my_json, headers=auth).json()
    #print(response)
    
    • The returned response is object where the file data is returned as base64 encoded string.

    enter image description here

    • You can use the following code to decode the above returned encoded string.
    import base64
    
    encoded_data = response['data']
    encoded = encoded_data.encode('ascii')
    to_decode = base64.b64decode(encoded)
    decoded_data = to_decode.decode('ascii')
    
    #print(decoded_data)
    

    enter image description here

    • Since the decoded_data is a string, I wrote it as text file called output.csv and read the file as a csv file using the following code.
    f = open(r'/dbfs/FileStore/tables/output.csv', 'w')
    f.write(decoded_data)
    f.close()
    
    df = spark.read.format("csv").option("header",True).load("/FileStore/tables/output.csv")
    df.show()
    

    enter image description here

    NOTE:

    • This process does not work for parquet files. For other file formats, the procedure changes (above is an example for csv file, process is different for json or other format files).

    • There is also a size limitation of 1MB specified in the Microsoft documentation for using the DBFS read API.

    • So, instead of doing this, you can either download the file to local computer and re-upload the file to 2nd databricks dbfs or store the file in a storage account from where it can be easily accessible.