I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can achieve your desired result by doing this,
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
print("Original Data:")
df.show()
print("Result:")
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
) \
.withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
.withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
.orderBy("ID")\
.drop("ID", "group", "event_freq", "element_freq")\
.show()
Original Data:
+----+
|Data|
+----+
| a|
| a|
| a|
| a|
| a|
| b|
| b|
| b|
| a|
| a|
| b|
+----+
Result:
+----+---------+
|Data|Frequency|
+----+---------+
| a| a1,1|
| a| a1,2|
| a| a1,3|
| a| a1,4|
| a| a1,5|
| b| b1,1|
| b| b1,2|
| b| b1,3|
| a| a2,1|
| a| a2,2|
| b| b2,1|
+----+---------+