I'm trying to import a CSV file that has no headers into DBFS for Azure Databricks, but, no matter whether I use the UI or try to do it by code, the output shows null values for all the four columns.
Here's the code I ran:
from pyspark.sql.types import *
# File location and type
file_location = "/FileStore/tables/sales.csv"
file_type = "csv"
# Options
delimiter = ","
customSchema = StructType([\
StructField("id", StringType(), True),\
StructField("company", IntegerType(), True),\
StructField("date", TimestampType(), True),\
StructField("price", DoubleType(), True)])
# Dataframe from CSV
df = spark.read.format(file_type) \
.schema(customSchema) \
.option("sep", delimiter) \
.load(file_location)
display(df)
And the output I get is:
What is happening here? If I don't define any schema it loads the data perfectly, but then I don't have the chance neither to specify the headers nor to specify datatypes.
I uploaded a sample csv file to test your script, which content is as below.
1,Company-A,2019-09-30,10.01
2,Company-B,2019-09-29,20.02
Then, I tried to reproduce your issue successfully as the figure below, and I think the issue was caused by the incorrect type for the struct field company
.
So I tried to use StringType
instead of IntegerType
for the field company
, then it works fine, as the figure below.
Meanwhile, if the value of the date
field is just a date, you can use DateType
instead of TimestampType
and the result as the figure below.
By the way, there are two other solutions for your needs.
Use the spark.read.csv
function with schema
parameter to read a headless csv file, as the code and figure below.
df = spark.read.csv(file_location, schema='id INT, company STRING, date DATE, price DOUBLE')
display(df)
Use pandas
package to read the csv file from dbfs
file path on Azure Databricks first, then to create a Spark DataFrame from the pandas dataframe, as the code and figure below.
import pandas as pd
df_pandas = pd.read_csv('/dbfs/FileStore/tables/sales.csv', header=None, names = ['id', 'company', 'date', 'price'])
df = spark.createDataFrame(df_pandas)
display(df)