Search code examples

How can I import data downloaded from Kaggle to DBFS using Databricks Community Edition?

I managed to download datasets from Kaggle using Kaggle API. And the data was stored under the directory of /databricks/driver.

%sh pip install kaggle
export KAGGLE_USERNAME=my_name
export KAGGLE_KEY=my_key
kaggle competitions download -c ncaaw-march-mania-2021
%sh unzip

The problem is: How can I use them in DBFS? The following is how I read data and the error I got when I tried to use pyspark to read csv files:'/databricks/driver/WDataFiles_Stage1/Cities.csv')
AnalysisException: Path does not exist: dbfs:/databricks/driver/WDataFiles_Stage1/Cities.csv


  • works with DBFS paths by default, so you have two choices:

    • use file:/databricks/driver/... to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed cluster

    • copy files to DBFS using the dbutils.fs.cp command (docs) and read from DBFS:

    df ="/FileStore/Cities.csv")