Search code examples
arrayspysparkfind-occurrences

Get the most common element of an array using Pyspark


How can I get the most common element of an array after concatenating two columns using Pyspark

df = spark.createDataFrame([
  [['a','a','b'],['a']],
  [['c','d','d'],['']],
  [['e'],['e','f']],
  [[''],['']]
]).toDF("arr_1","arr2")

df_new = df.withColumn('arr',F.concat(F.col('arr_1'),F.col('arr_2'))

expected output:

+------------------------+
| arr  | arr_1   | arr_2 |
+------------------------+
| [a]  | [a,a,b] | [a]   |
| [d]  | [c,d,d] | []    |
| [e]  | [e]     | [e,f] |
| []   | []      | []    | 
+------------------------+

Solution

  • Try it

    df1 = df.select('arr_1','arr_2',monotonically_increasing_id().alias('id'),concat('arr_1','arr_2').alias('arr'))
       
    df1.select('id',explode('arr')).\
       groupBy('id','col').count().\
       select('id','col','count',rank().over(Window.partitionBy('id').orderBy(desc('count'))).alias('rank')).\
       filter(col('rank')==1).\
       join(df1,'id').\
       select(col('col').alias('arr'), 'arr_1', 'arr_2').show()
    
    +---+---------+------+
    |arr|    arr_1| arr_2|
    +---+---------+------+
    |  a|[a, a, b]|   [a]|
    |   |       []|    []|
    |  e|      [e]|[e, f]|
    |  d|[c, d, d]|    []|
    +---+---------+------+