Input dataframe is below:
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 10| 20| 11| 20|
| 20| 11| 10| 99|
| 10| 11| 20| 1|
| 30| 12| 20| 99|
| 10| 11| 20| 20|
| 40| 13| 15| 3|
| 30| 8| 11| 99|
I am using join but not getting desired result, can you please generate output dataframe like below in pyspark:
+---+-------+---+-------+---+-------+---+-------+
| A|A_Count| B|B_Count| C|C_Count| D|D_Count|
+---+-------+---+-------+---+-------+---+-------+
| 10| 3| 8| 1| 10| 1| 1| 1|
| 20| 1| 11| 3| 11| 2| 3| 1|
| 30| 2| 12| 1| 15| 1| 20| 2|
| 40| 1| 13| 1| 20| 3| 99| 3|
+---+-------+---+-------+---+-------+---+-------+
The below code works for me.
list_of_sdf =[]
for coll in ab.columns:
w0=Window.orderBy(monotonically_increasing_id())
ab=ab.withColumn(coll,col(coll).cast(IntegerType()))
ab1=ab.groupBy(col(coll)).agg(count(col(coll)))
list_of_sdf.append(ab1.select([coll,f"count({coll})"]).sort([coll]).withColumn("row",row_number().over(w0)))
def join_dfs(df1, df2, join_column="row"):
return df1.join(df2, on=join_column, how='inner')
# Perform successive joins using reduce
if list_of_sdf:
joined_df = reduce(join_dfs, list_of_sdf)
else:
print("The list of DataFrames is empty.")