Search code examples
databricksdatabricks-autoloader

Can Databricks Auto loader infer partitions?


By default, when you're using Hive partitions directory structure,the auto loader option cloudFiles.partitionColumns add these columns automatically to your schema (using schema inference).

This is the code:

checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
(
    spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path)
    .load("s3://autoloader-source/json-data")
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable("dev_catalog.dev_database.dev_table")
)

But can we have an option to also create partitionq to the target table like you can do with a simple CREATE TABLE ? (E.g. if you have such classical structure /year=xxxx/month=xxx/day=xx)


Solution

  • You can use the .partitionBy() function.

    checkpoint_path = "s3://dev-bucket/_checkpoint/dev_table"
    (
        spark
        .readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", checkpoint_path)
        .load("s3://autoloader-source/json-data")
        .writeStream
        .option("checkpointLocation", checkpoint_path)
        .partitionBy("col1", "col2")
        .trigger(availableNow=True)
        .toTable("dev_catalog.dev_database.dev_table")
    )