Search code examples
pandaspysparkazure-databricksparquet

spark read parquet vs pandas read parquet


I just wanted to understand the difference between spark and pandas read parquet files command.

I have some data stored in my DBFS and I am using below commands to successfully read that data into my databricks instance:

d1 = spark.read.parquet('/user/hive/xyz/abc.db/test_file') --> this works for spark read
d2 = pd.read_parquet('/dbfs/user/hive/xyz/abc.db/test_file') --> this works for pandas read

I want to understand why I have to add the "/dbfs" at the start to be able to read the file using pandas. If I use the same path as ‘/user/hive/xyz/abc.db/test_file' I get below error:

Path does not exist: dbfs:/dbfs/user/hive//xyz/abc.db/test_file

I am able to read file using both commands. I just want to understand why I need to change the path when I am using pandas parquet read


Solution

  • /dbfs is so-called DBFS fuse that exposes the DBFS to the programs that rely on the local file APIs (open, ... in Python, etc.).

    By default, when you use Spark commands, then dbfs: scheme is default, so your path /user/hive/xyz/abc.db/test_file is actually: dbfs:/user/hive/xyz/abc.db/test_file.

    Databricks documentation has a good table that describes differences between different file schemas, etc.