Search code examples
databrickskaggledatabricks-community-edition

How can I import data downloaded from Kaggle to DBFS using Databricks Community Edition?


I managed to download datasets from Kaggle using Kaggle API. And the data was stored under the directory of /databricks/driver.

%sh pip install kaggle
%sh
export KAGGLE_USERNAME=my_name
export KAGGLE_KEY=my_key
kaggle competitions download -c ncaaw-march-mania-2021
%sh unzip ncaaw-march-mania-2021.zip

The problem is: How can I use them in DBFS? The following is how I read data and the error I got when I tried to use pyspark to read csv files:

spark.read.csv('/databricks/driver/WDataFiles_Stage1/Cities.csv')
AnalysisException: Path does not exist: dbfs:/databricks/driver/WDataFiles_Stage1/Cities.csv

Solution

  • spark.read... works with DBFS paths by default, so you have two choices:

    • use file:/databricks/driver/... to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed cluster

    • copy files to DBFS using the dbutils.fs.cp command (docs) and read from DBFS:

    dbutils.fs.cp("file:/databricks/driver/WDataFiles_Stage1/Cities.csv", 
       "/FileStore/Cities.csv")
    df = spark.read.csv("/FileStore/Cities.csv")
    ....