Search code examples
pythonpandasmatplotlibscatter-plotcolor-scheme

Colour code the plot based on the two data frame values


I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.

Here is the code:

def func():
    ...

df = pd.read_csv(PATH + file, sep=",", header=None)


b = 2.72
a = 0.00000009

popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])

perr = np.sqrt(np.diag(pcov))

plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure

plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure

plt.legend(loc="upper left")

Here is the sample dataset:

**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**

file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...

So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.

and the scatter plot:

scatter_plot

This is roughly how my desired solution of the color code needs to look like:

desired colorcode enter image description here

I have around 200 entries in the csv file.

Does using NumPy in this scenario is more advantageous ?


Solution

  • Let me know if this is appropriate or if I have misunderstood anything-

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    
    # not needed for you
    # df = pd.read_csv('~/Documents/tmp.csv')
    
    max_2 = pd.DataFrame(df.groupby('1').max()['2'])
    
    no_unique_colors = 3
    color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
    # assign colors to unique df2 in cyclic order
    max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]
    
    # calculate the opacities for each entry in the dataframe
    colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
    # repeat thrice so that df2, df4 and df5 share the same opacity
    colors = [x for x in colors for _ in range(3)]
    
    plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
    plt.show()
    

    enter image description here