First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.
I am reading a CSV file like this:
df1 = spark.read.format("csv").option("header", True).load(path_to_csv_file)
Then I'm saving it as a Delta Live Table like this:
df1.write.format("delta").save("table_path")
The CSV headers have characters in them like space and &
and /
, and I get the error:
AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://docs.databricks.com/delta/delta-column-mapping.html Or you can use alias to rename it.
The documentation I've seen on the issue explains how to set the column mapping mode to 'name' AFTER a table has been created using ALTER TABLE
, but does not explain how to set it at creation time, especially when using the DataFrame API as above. Is there a way to do this?
Is there a better way to get CSV into a new table?
UPDATE:
Reading the docs here and here, and inspired by Robert's answer, I tried this first:
spark.conf.set("spark.databricks.delta.defaults.columnMapping.mode", "name")
Still no luck, I get the same error. It's interesting how hard it is for a beginner to write a CSV file with spaces in its headers to a Delta Live Table
Thanks to Hemant on the Databricks community forum, I have found the answer.
df1.write.format("delta").option("delta.columnMapping.mode", "name")
.option("path", "table_path").saveAsTable("new_table")
Now I can either query it with SQL or load it into a Spark dataframe:
SELECT * FROM new_table;
delta_df = spark.read.format("delta").load("table_path")
display(delta_df)
This method does the same thing but in SQL.
First, create a CSV-backed table for your CSV file:
CREATE TABLE table_csv
USING CSV
OPTIONS (path '/path/to/file.csv', 'header' 'true', 'mode' 'FAILFAST');
Then create a Delta table using the CSV-backed table:
CREATE TABLE delta_table
USING DELTA
TBLPROPERTIES ("delta.columnMapping.mode" = "name")
AS SELECT * FROM table_csv;
SELECT * FROM delta_table;
I've verified that I get the same error as I did when using Python should I omit the TBLPROPERTIES statement.
I guess the Python answer would be to use spark.sql
and run this using Python, that way I could embed the CSV path variable in the SQL.