Search code examples
matplotlibpysparkscatter-plot

Naming Data Points with labels using scatter plot pyspark


I have the below dataframe

----------------------------------------
|date       |student_name| count | cluster|
|------------|---------- |-------|--------|
|234454333333|A          |50     |2       |
|345000004000|B          |100    | 4      |
|345000004050|C          |95     | 4      |
------------------------------------------

Using this dataframe I am drawing a scatter plot as follows

c1 = data_pd[data_pd.cluster == 0]
c2 = data_pd[data_pd.cluster == 1]
c3 = data_pd[data_pd.cluster == 2]
c4 = data_pd[data_pd.cluster == 3]
c5 = data_pd[data_pd.cluster == 4]
plt.scatter(c1.date, c1['count'],color='green')
plt.scatter(c2.date, c2['count'],color='blue')
plt.scatter(c3.date, c3['count'],color='red')
plt.scatter(c4.date, c4['count'],color='pink')
plt.scatter(c5.date, c5['count'],color='yellow')
plt.xlabel('date')
plt.ylabel('count')

I want to name each data point with the respective student_name value from the dataframe. How can I achieve this using pyspark?


Solution

  • I generated a small example just to give you an idea how this could be achieved using annotate.

    import random
    from pyspark.sql import SparkSession
    import matplotlib.pyplot as plt
    
    
    def plot_cluster(cluster, color, data_pd):
        data = data_pd[data_pd.cluster == cluster]
        plt.scatter(data.date, data["count"], color=color)
        for i, label in enumerate(data["student_name"]):
            plt.annotate(label, (data.date.iloc[i], data["count"].iloc[i]))
    
    
    if __name__ == "__main__":
        spark = SparkSession.builder.getOrCreate()
        data = [
            {
                "date": 234454333333 + random.randrange(50000),
                "student_name": random.choice(["A", "B", "C"]),
                "count": random.randrange(20, 100),
                "cluster": random.randrange(5),
            }
            for _ in range(100)
        ]
        df = spark.createDataFrame(data)
        data_pd = df.toPandas()
        clusters = [0, 1, 2, 3, 4]
        colors = ["green", "blue", "red", "pink", "yellow"]
        for cluster, color in zip(clusters, colors):
            plot_cluster(cluster, color, data_pd)
        plt.xlabel("date")
        plt.ylabel("count")
        plt.show()
    

    X axis should obviously be taken care of but it doesn't matter here

    Figure:

    enter image description here