Search code examples
pythonpandasgoogle-cloud-platformdatabricksgcp-databricks

Can't read directly from pandas on GCP Databricks


Usually on Databricks on Azure/AWS, to read files stored on Azure Blob/S3, I would mount the bucket or blob storage and then do the following:

If using Spark

df = spark.read.format('csv').load('/mnt/my_bucket/my_file.csv', header="true")

If using directly pandas, adding /dbfs to the path:

df = pd.read_csv('/dbfs/mnt/my_bucket/my_file.csv')

I am trying to do the exact same thing on the hosted version of Databricks with GCP and though I successfully manage to mount my bucket and read it with Spark, I am not able to do it with Pandas directly, adding the /dbfs does not work and I get a No such file or directory: ... error

Has any one of you encountered a similar issue ? Am I missing something ?

Also when I do

%sh 
ls /dbfs

It returns nothing though I can see in the UI the dbfs browser with my mounted buckets and files

Thanks for the help


Solution

  • It's documented in the list of features not released yet:

    DBFS access to local file system (FUSE mount). For DBFS access, the Databricks dbutils commands, Hadoop Filesystem APIs such as the %fs command, and Spark read and write APIs are available. Contact your Databricks representative for any questions.

    So you'll need to copy file to local disk before reading with Pandas:

    dbutils.fs.cp("/mnt/my_bucket/my_file.csv", "file:/tmp/my_file.csv")
    df = pd.read_csv('/tmp/my_file.csv')