Search code examples
apache-sparkpysparkdbt

Read Json with dbt using spark as engine


I wanted to create a Lakehouse using dbt with spark as it's engine. As a first step I want to read some raw files, e.g. json files, and write them as delta or iceberg table. But it seems like dbt-spark is not supporting that. Did I miss something or is this really not possible? If not, how can one ingest raw files and write again as table. I saw that dbt-duckdb is supporting that behaviour and it works, but sadly they do not support these external table formats. I just want to avoid creating single spark jobs for ingesting the data first. I would like to do everything with dbt.


Solution

  • You're correct that dbt-spark currently doesn't directly support reading raw files and writing to Delta or Iceberg tables within dbt models. However, you can try below approache

    1. Create standalone Spark jobs using PySpark or Scala to read raw files and write them as Delta or Iceberg tables.
    2. Schedule these jobs to run before your dbt models.
    3. Reference the generated tables in your dbt models for further transformations and analysis.

    you can also check experimental dbt-external-tables plugin (https://github.com/dbt-labs/dbt-external-tables