I just wanted to understand the difference between spark and pandas read parquet files command.
I have some data stored in my DBFS and I am using below commands to successfully read that data into my databricks instance:
d1 = spark.read.parquet('/user/hive/xyz/abc.db/test_file') --> this works for spark read
d2 = pd.read_parquet('/dbfs/user/hive/xyz/abc.db/test_file') --> this works for pandas read
I want to understand why I have to add the "/dbfs"
at the start to be able to read the file using pandas. If I use the same path as ‘/user/hive/xyz/abc.db/test_file'
I get below error:
Path does not exist: dbfs:/dbfs/user/hive//xyz/abc.db/test_file
I am able to read file using both commands. I just want to understand why I need to change the path when I am using pandas parquet read
/dbfs
is so-called DBFS fuse that exposes the DBFS to the programs that rely on the local file APIs (open
, ... in Python, etc.).
By default, when you use Spark commands, then dbfs:
scheme is default, so your path /user/hive/xyz/abc.db/test_file
is actually: dbfs:/user/hive/xyz/abc.db/test_file
.
Databricks documentation has a good table that describes differences between different file schemas, etc.