Search code examples
pysparkdatabricksdelta-live-tables

DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?


First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.

I am reading a CSV file like this:

df1 = spark.read.format("csv").option("header", True).load(path_to_csv_file)

Then I'm saving it as a Delta Live Table like this:

df1.write.format("delta").save("table_path")

The CSV headers have characters in them like space and & and /, and I get the error:

AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://docs.databricks.com/delta/delta-column-mapping.html Or you can use alias to rename it.

The documentation I've seen on the issue explains how to set the column mapping mode to 'name' AFTER a table has been created using ALTER TABLE, but does not explain how to set it at creation time, especially when using the DataFrame API as above. Is there a way to do this?

Is there a better way to get CSV into a new table?


UPDATE:

Reading the docs here and here, and inspired by Robert's answer, I tried this first:

spark.conf.set("spark.databricks.delta.defaults.columnMapping.mode", "name")

Still no luck, I get the same error. It's interesting how hard it is for a beginner to write a CSV file with spaces in its headers to a Delta Live Table


Solution

  • Thanks to Hemant on the Databricks community forum, I have found the answer.

    df1.write.format("delta").option("delta.columnMapping.mode", "name")
                             .option("path", "table_path").saveAsTable("new_table")
    

    Now I can either query it with SQL or load it into a Spark dataframe:

    SELECT * FROM new_table;
    
    delta_df = spark.read.format("delta").load("table_path")
    display(delta_df)
    

    SQL Way

    This method does the same thing but in SQL.

    First, create a CSV-backed table for your CSV file:

    CREATE TABLE table_csv
      USING CSV
      OPTIONS (path '/path/to/file.csv', 'header' 'true', 'mode' 'FAILFAST');
    

    Then create a Delta table using the CSV-backed table:

    CREATE TABLE delta_table
      USING DELTA
      TBLPROPERTIES ("delta.columnMapping.mode" = "name")
      AS SELECT * FROM table_csv;
    
    SELECT * FROM delta_table;
    

    I've verified that I get the same error as I did when using Python should I omit the TBLPROPERTIES statement.

    I guess the Python answer would be to use spark.sql and run this using Python, that way I could embed the CSV path variable in the SQL.