Search code examples
pythonsqlapache-sparkapache-iceberg

Not able to get Array size in Apache Iceberg with Spark 3.2.0 or before


From official doc: https://spark.apache.org/docs/latest/api/sql/index.html#array_size , it is present from Spark 3.3.0 but I need the same in Spark 3.2.0

Is there some alternative for array_size that I can use while writing SQL query for data residing in Apache Iceberg table. (SQL query is then run through Apache Spark 3.2.2)


Solution

  • Here is some sample code to do that in a DataFrame, using size function and explode:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import size, explode
    
    spark = SparkSession.builder.appName("AlternativeArray_Size").getOrCreate()
    
    data = [(1, [10, 20, 30]), (2, [40, 50]), (3, [60])]
    columns = ["id", "values"]
    df = spark.createDataFrame(data, columns)
    
    result = df.withColumn("array_size", size(explode(df.values)))
    

    In this example, the explode function is used to transform each array element into a separate row, and then the size function is used to count the number of rows.

    If an array is present, explode is not necessary, and this should work size(array('b', 'd', 'c', 'a')) (returns 4)

    size(map('a', 1, 'b', 2))returns 2. If the expression is NULL, the function returns -1 instead of NULL. The function returns null for null input if spark.sql.legacy.sizeOfNull is set to false or spark.sql.ansi.enabled is set to true.

    In actual fact, there is not much difference between array_size and size, what values they give for NULL is different for example, and array_size only takes arrays, whereas size can also be used for maps as in example above.

    Source: Documentation