Search code examples
pysparkdatabricksdatabricks-autoloader

databricks autoloader use MAP() type as a schema hint


I am attempting to set up a readStream using autoloader in pyspark databricks:

spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("inferSchema", True) \
  .option("cloudFiles.schemaLocation", schema_path) \
  .option("cloudFiles.schemaHints", "col1 string, col2 timestamp, col3 timestamp, col4 timestamp, col5 timestamp, col6 int, col7 MAP<STRING,STRING>, col8 MAP<STRING,STRING>, col9 MAP<STRING,STRING>, col10 MAP<STRING,STRING>, col11 MAP<STRING,STRING>, col12 MAP<STRING,STRING>, col13 MAP<STRING,STRING>") \
  .option("cloudFiles.schemaEvolutionMode", "rescue") \
  .load(raw_path_df) \
  .writeStream \
  .option("checkpointLocation", checkpoint_path) \
  .trigger(once=True)\
  .toTable(bronze_tbl)

However, I keep getting java.lang.Exception: Unsupported type: map<string,string>

Not sure why this is happening? I have used autoloader to read in data countless times before, and have used the Map() type as a schema hint. Not sure what I am missing here.

The code in the readStream above works as soon as I remove the schema hint param.


Solution

  • This happens because CSV by definition doesn't support complex types - only strings, numbers, ... Otherwise what data representation for type should be used? JSON-encoded, or something custom?

    If you have your data encoded as JSON, then you simply need to apply from_json to the corresponding columns.