Search code examples
pythoncsvpysparkdatabricksazure-databricks

NULL values when trying to import CSV in Azure Databricks DBFS


I'm trying to import a CSV file that has no headers into DBFS for Azure Databricks, but, no matter whether I use the UI or try to do it by code, the output shows null values for all the four columns.

Here's the code I ran:

from pyspark.sql.types import *

# File location and type
file_location = "/FileStore/tables/sales.csv"
file_type = "csv"

# Options

delimiter = ","

customSchema = StructType([\
    StructField("id", StringType(), True),\
    StructField("company", IntegerType(), True),\
    StructField("date", TimestampType(), True),\
    StructField("price", DoubleType(), True)])

# Dataframe from CSV
df = spark.read.format(file_type) \
    .schema(customSchema) \
    .option("sep", delimiter) \
    .load(file_location)

display(df)

And the output I get is:

Databricks Output

What is happening here? If I don't define any schema it loads the data perfectly, but then I don't have the chance neither to specify the headers nor to specify datatypes.


Solution

  • I uploaded a sample csv file to test your script, which content is as below.

    1,Company-A,2019-09-30,10.01
    2,Company-B,2019-09-29,20.02
    

    Then, I tried to reproduce your issue successfully as the figure below, and I think the issue was caused by the incorrect type for the struct field company.

    enter image description here

    So I tried to use StringType instead of IntegerType for the field company, then it works fine, as the figure below.

    enter image description here

    Meanwhile, if the value of the date field is just a date, you can use DateType instead of TimestampType and the result as the figure below.

    enter image description here

    By the way, there are two other solutions for your needs.

    1. Use the spark.read.csv function with schema parameter to read a headless csv file, as the code and figure below.

      df = spark.read.csv(file_location, schema='id INT, company STRING, date DATE, price DOUBLE')
      display(df)
      

      enter image description here

    2. Use pandas package to read the csv file from dbfs file path on Azure Databricks first, then to create a Spark DataFrame from the pandas dataframe, as the code and figure below.

      import pandas as pd
      df_pandas = pd.read_csv('/dbfs/FileStore/tables/sales.csv', header=None, names = ['id', 'company', 'date', 'price'])
      df = spark.createDataFrame(df_pandas)
      display(df)
      

      enter image description here