Search code examples
pythonmatplotlibseaborndata-science

In a scatterplot, how do I plot a line that is an average of the all vertical coordinates of datapoints that has the same x coordinate


I want something like the plots shown in figure below, where the blue line is the average line that is generated by plotting the mean of all y-coordinate values of data-points that have the same x-coordinate values.

Fig-1

I tried the code below

window_size = 10
df_avg = pd.DataFrame(columns=df.columns)

for col in df.columns:
    df_avg[col] = df[col].rolling(window=window_size).mean()

plt.figure(figsize=(20,20))
for idx, col in enumerate(df.columns, 1):
    plt.subplot(df.shape[1]-4, 4, idx)
    sns.scatterplot(data=df, x=col, y='charges')
    plt.plot(df_avg[col],df['charges'])
    plt.xlabel(col)

And, got plots shown below, which obviously, is not what I wanted. Fig-2


Solution

  • If you're looking for a purely matplotlib way to do it. Here is a possible direction you can take:

    import matplotlib.pyplot as plt
    import numpy as np
    
    ### Create toy dataset consisting of (500,2) points
    N_points=500
    rand_pts=np.random.choice(50,size=(N_points,2))
    
    #create a dictionary with keys the unique x values and values the different y values corresponding to this unique x
    rand_dict={uni:rand_pts[np.where(rand_pts[:,0]==uni),1] for uni in np.unique(rand_pts[:,0])}
    
    #plot
    plt.scatter(rand_pts[:,0],rand_pts[:,1],s=50) #plot the scatter plot
    plt.plot(list(rand_dict.keys()),[np.mean(val) for val in rand_dict.values()],color='tab:orange',lw=4) #plot the mean y values for each unique x
    

    enter image description here