Search code examples
apache-sparkpyspark

Employing Pyspark How to determine the frequency of each event and its event-by-event frequency


I have a dataset like:

Data

a
a
a
a
a
b
b
b
a
a
b

I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?

Data   Frequency
     a        a1,1
     a        a1,2    
     a        a1,3
     a        a1,4
     a        a1,5
     b        b1,1
     b        b1,2
     b        b1,3
     a        a2,1
     a        a2,2
     b        b2,1

Solution

  • You can achieve your desired result by doing this,

    from pyspark.sql import Window
    import pyspark.sql.functions as F
    
    df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
    
    print("Original Data:")
    df.show()
    
    print("Result:")
    df.withColumn("ID", F.monotonically_increasing_id()) \
        .withColumn("group",
                F.row_number().over(Window.orderBy("ID"))
                - F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
        ) \
        .withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
        .withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
        .withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
        .orderBy("ID")\
        .drop("ID", "group", "event_freq", "element_freq")\
        .show()
    
    Original Data:
    +----+
    |Data|
    +----+
    |   a|
    |   a|
    |   a|
    |   a|
    |   a|
    |   b|
    |   b|
    |   b|
    |   a|
    |   a|
    |   b|
    +----+
    
    Result:
    +----+---------+
    |Data|Frequency|
    +----+---------+
    |   a|     a1,1|
    |   a|     a1,2|
    |   a|     a1,3|
    |   a|     a1,4|
    |   a|     a1,5|
    |   b|     b1,1|
    |   b|     b1,2|
    |   b|     b1,3|
    |   a|     a2,1|
    |   a|     a2,2|
    |   b|     b2,1|
    +----+---------+