Search code examples
databricksazure-databricks

CSV File download from Databricks Filestore in Python not working


I am using the Python code below to download a csv file from Databricks Filestore. Usually, files can be downloaded via the browser when kept in Filestore.

When I directly enter the url to the file in my browser, the file downloads ok. BUT when I try to do the same via the code below, the content of the downloaded file is not the csv but some html code - see far below.

Here is my Python code:

def download_from_dbfs_filestore(file):
    url ="https://databricks-hot-url/files/{0}".format(file)
    req = requests.get(url)
    req_content = req.content
    my_file = open(file,'wb')
    my_file.write(req_content)
    my_file.close()

Here is the html. It appears to be referencing a login page but am not sure what to do from here:

<!doctype html><html><head><meta charset="utf-8"/>
<meta http-equiv="Content-Language" content="en"/>
<title>Databricks - Sign In</title><meta name="viewport" content="width=960"/>
<link rel="icon" type="image/png" href="/favicon.ico"/>
<meta http-equiv="content-type" content="text/html; charset=UTF8"/><link rel="icon" href="favicon.ico">
</head><body class="light-mode"><uses-legacy-bootstrap><div id="login-page">
</div></uses-legacy-bootstrap><script src="login/login.xxxxx.js"></script>
</body>
</html>

Solution

  • Solved the problem by using base64 module b64decode:

    import base64 
    DOMAIN = <your databricks 'host' url>
    TOKEN = <your databricks 'token'>
    jsonbody = {"path": <your dbfs Filestore path>}
    response = requests.get('https://%s/api/2.0/dbfs/read/' % (DOMAIN), headers={'Authorization': 'Bearer %s' % TOKEN},json=jsonbody )
    if response.status_code == 200:
        csv=base64.b64decode(response.json()["data"]).decode('utf-8')
        print(csv)