Search code examples
apache-sparkpysparkapache-iceberg

Partition bucket by year and month in PySpark


I have a dataframe like:

Cod   Date
1     2022-01-01
1     2022-01-10
1     2022-02-01
2     2022-03-01
2     2022-04-01

I'm trying to use Apache Iceberg to partition my dataframe by Cod/Year/Month using hiding partitioning.

spark.sql("CREATE TABLE local.table USING iceberg PARTITIONED BY (Cod, years(Date), months(Date)) as (SELECT * FROM table_df);")

I got the following error:

pyspark.sql.utils.IllegalArgumentException: Cannot add redundant partition: 1001: Date_year: year(2) conflicts with 1002: Date_month: month(2)

According to Iceberg docs, this error is intentional, the workaround is:

One workaround is to update the table with additional time based partitions after creating it with a single time based partition.

So I've tried to update my partitions using (I removed the months(Date) from the first query when trying this approach):

spark.sql("ALTER TABLE local.table SET PARTITION SPEC (months(Date));")

Although I got the following error:

no viable alternative at input 'ALTER TABLE local.table SET PARTITION'(line 1, pos 28)

I also tried a PySpark approach, like:

table.writeTo("local.table").partitionedBy(f.col("Code"), f.year(f.col("Date")), f.month(f.col("Date"))).create()

But I got the following error:

pyspark.sql.utils.AnalysisException: Invalid partition transformation: year(DataReferencia)

What can I do to solve this problem?


Solution

  • After a lot of research, Iceberg does not accept redundant partitions. The solution that I found to partition by year and month was:

    First I created a table partitioned only by year

    CREATE TABLE local.table(
        ...
    ) USING iceberg PARTITIONED BY (years(ts));
    

    Then I partitioned by month using ALTER TABLE

    ALTER TABLE local.table ADD PARTITION FIELD months(ts)