How to calculate average by category in pyspark streaming?

I have csv data coming as DStreams from traffic counters. Sample is as follows


I want to calculate average speed (for each location) by vehicle category.

I want to achieve this by transformations. Below is the result i am looking for.

Location |  Car | MBike
Tracker 1| 73.5 |  73
Tracker 2| 51.5 |  52.5


  • I'm not sure exactaly what you want, but if it's avarage speed by vehicle, by location, than you can use a Window function:

    df = spark.createDataFrame(
    from pyspark.sql import Window
    import pyspark.sql.functions as F
    w = Window.partitionBy("Location","Vehicle")
    df_pivot = df\
                .withColumn('avg_speed', F.avg(F.col('Speed')).over(w))\
                .groupby('Location','Vehicle', 'avg_speed')\
                .drop('Vehicle', 'avg_speed')
    expr = {x: "sum" for x in df_pivot.columns if x is not df_pivot.columns[0]}
    df_almost_final = df_pivot\
    df_final =[F.col(c).alias(c.replace('sum(','').replace(')','')) for c in df_almost_final.columns])
    # +--------+-----+----+
    # |Location|mbike| car|
    # +--------+-----+----+
    # |tracker1| 73.0|73.5|
    # |tracker2| 52.5|51.5|
    # +--------+-----+----+