Search code examples
apache-sparkpysparkdatabricksazure-databricksdatabricks-autoloader

Ingest CSV data with Auto Loader with Specific Delimiters / separator


I'm trying to load a several csv files with a complex separator("~|~")

The current code currently loads the csv files but is not identifying the correct columns because is using the separator (",").

I'm reading the documentation here https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-csv.html but it doesn't say anything about it or at least I'm not able to see it

spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  # The schema location directory keeps track of your data schema over time
  .option("cloudFiles.schemaLocation", "<path-to-checkpoint>") \
  .load("<path-to-source-data>") \
  .writeStream \
  .option("mergeSchema", "true") \
  .option("checkpointLocation", "<path-to-checkpoint>") \
  .start("<path-to-target")

Solution

  • Documentation says:

    With Auto Loader you can ingest JSON, CSV, PARQUET, AVRO, TEXT, BINARYFILE, and ORC files. See Format options for the options for these file formats.

    So you can just use standard options for CSV files - you need the delimiter (or sep) option:

    df = spark.readStream.format("cloudFiles") \
      .option("cloudFiles.format", "csv") \
      .option("delimiter", "~|~") \
      .schema(...) \
      .load(...)