I need data transformation write in order to optimize for later reads. I planned to do this with pyspark with
.repartitionByRange(max_partitions, ..., rand())
.bucketBy(numBuckets, ...)
.sortBy(...)
.option("maxRecordsPerFile", 1000000)
As this is just a transformations I thought this could be a good use case for me to try dbt
I never used dbt - question would I be able to achieve the same with dbt over spark if i'm not the admin of the dbt instance and can only write queries over it on top of spark connector?
Thanks
The dbt-spark adapter currently supports partition_by
, cluster_by
, and buckets
in the model config, which are the same options offered in SparkSQL's CREATE TABLE
statement.