How can I get the most common element of an array after concatenating two columns using Pyspark
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr2")
df_new = df.withColumn('arr',F.concat(F.col('arr_1'),F.col('arr_2'))
expected output:
+------------------------+
| arr | arr_1 | arr_2 |
+------------------------+
| [a] | [a,a,b] | [a] |
| [d] | [c,d,d] | [] |
| [e] | [e] | [e,f] |
| [] | [] | [] |
+------------------------+
Try it
df1 = df.select('arr_1','arr_2',monotonically_increasing_id().alias('id'),concat('arr_1','arr_2').alias('arr'))
df1.select('id',explode('arr')).\
groupBy('id','col').count().\
select('id','col','count',rank().over(Window.partitionBy('id').orderBy(desc('count'))).alias('rank')).\
filter(col('rank')==1).\
join(df1,'id').\
select(col('col').alias('arr'), 'arr_1', 'arr_2').show()
+---+---------+------+
|arr| arr_1| arr_2|
+---+---------+------+
| a|[a, a, b]| [a]|
| | []| []|
| e| [e]|[e, f]|
| d|[c, d, d]| []|
+---+---------+------+