Search code examples
pythonmatplotlibhistogramscatter

How to plot a scatter plot which would also represent the histogram for y value for each value for x


I have a set of X and Y data points (about 20k) that I would like to represent using a scatter plot.

The data set looks something list this

x = [1, 1, 2, 1, 2, 1, 1, 2]

y = [3.1, 3.1, 3.1, 1, 2, 3.1, 1, 2]

(not all values are integers in the data actual data set)

I would like to make a scatter plot with color of where the color would indicate the frequency of a particular value in 'y' for a particular 'x'

For this I tried to calculate the histogram of y for each x value but I always end up with a plot which is wrong. the codes I use are shown below

 x = [1, 1, 2, 1, 2, 1, 1, 2]
    
 y = [3.1, 3.1, 3.1, 1, 2, 3.1, 1, 2]
    
 I = []
    
 Y = []
    
 C = []
    
 for i in range (0, len(x)):

    if x[i] not in I :
    
        I.append(x[i])
    
        for j in range (0, len(x)):
    
            if x[i] == x[j]:
    
                Y.append(y[j])
    
                u,c = np.unique(Y, return_counts=True)
    
                C.append(c)
    
                Y = []
             
plt.scatter(x, y, s=70, c=C, cmap='RdYlBu', marker='o', edgecolors='black', linewidth=1, alpha=7)


plt.xlabel('x')

plt.ylabel('y')

plt.colorbar()

the final plot looks like this final plot

It will be really helpful if someone could tell me where I'm making a mistake or how could I possibly achieve this. I'm very new to python so more explanation is appreciated.

Thank you in advance. (also will it be possible to make the dot having the same value appear repeatedly with the same color?)


Solution

  • Here is a code that works for you :

    import numpy as np
    import matplotlib.pyplot as plt 
    
    x = np.array([1, 1, 2, 1, 2, 1, 1, 2])
    
    y = np.array([3.1, 3.1, 3.1, 1, 2, 3.1, 1, 2])
    X=[]
    Y=[]
    C=[]
    
    for i in np.unique(x):
        new_y = y[np.where(x==i)]
        unique,count = np.unique(new_y, return_counts=True)
    
        for j in range(len(unique)):
            X.append(i)
            Y.append(unique[j])
            C.append(count[j])
    
    plt.scatter(X,Y,c=C)
    plt.colorbar()
    

    What I do is that for each "unique" value of x I check the values of y using the build in numpy function where. Then my count is not much different from yours.

    Here is the result:

    result