Search code examples
scalaapache-sparkapache-spark-sqlparquet

Converting csv to parquet in spark gives error if csv column headers contain spaces


I have csv file which I am converting to parquet files using databricks library in scala. I am using below code:

val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir", "local").getOrCreate()
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv(csvfile)
csvdf.write.parquet(csvfile + "parquet")

Now the above code works fine if I don't have space in my column headers. But if any csv file have spaces in the column headers, it doesn't work and errors out stating invalid column headers. My csv files are delimited by ,.

Also, I cannot change the spaces of column names of the csv. The column names has to be as they are even if they contain spaces as those are given by end user.

Any idea on how to fix this?


Solution

  • per @CodeHunter's request

    sadly, the parquet file format does not allow for spaces in column names;
    the error that it'll spit out when you try is: contains invalid character(s) among " ,;{}()\n\t=".

    ORC also does not allow for spaces in column names :(

    Most sql-engines don't support column names with spaces, so you'll probably be best off converting your columns to your preference of foo_bar or fooBar or something along those lines