From official doc: https://spark.apache.org/docs/latest/api/sql/index.html#array_size , it is present from Spark 3.3.0 but I need the same in Spark 3.2.0
Is there some alternative for array_size that I can use while writing SQL query for data residing in Apache Iceberg table. (SQL query is then run through Apache Spark 3.2.2)
Here is some sample code to do that in a DataFrame, using size function and explode:
from pyspark.sql import SparkSession
from pyspark.sql.functions import size, explode
spark = SparkSession.builder.appName("AlternativeArray_Size").getOrCreate()
data = [(1, [10, 20, 30]), (2, [40, 50]), (3, [60])]
columns = ["id", "values"]
df = spark.createDataFrame(data, columns)
result = df.withColumn("array_size", size(explode(df.values)))
In this example, the explode function is used to transform each array element into a separate row, and then the size function is used to count the number of rows.
If an array is present, explode is not necessary, and this should work size(array('b', 'd', 'c', 'a'))
(returns 4)
size(map('a', 1, 'b', 2))
returns 2. If the expression is NULL, the function returns -1 instead of NULL. The function returns null for null input if spark.sql.legacy.sizeOfNull is set to false or spark.sql.ansi.enabled is set to true.
In actual fact, there is not much difference between array_size and size, what values they give for NULL is different for example, and array_size only takes arrays, whereas size can also be used for maps as in example above.
Source: Documentation