Search code examples
pythonmatplotlibscatter-plotcolorbarscatter

How to make a date-based color bar based on df.idxmax series?


Python beginner/first poster here.

I'm running into trouble adding color bars to scatter plots. I have two types of plot: one that shows all the data color-coded by date, and one that shows just the maximum values of my data color-coded by date. In the first case, I can use the df.index (which is datetime) to make my color bar, but in the second case, I am using df2['col'].idxmax to generate the colors because my df2 is a df.groupby object which I'm using to generate the daily maximums in my data, and it does not have an accessible index.

For the first type of plot, I have succeeded in generating a date-based color bar with the code below, cobbled together from online examples:

fig, ax = plt.subplots(1,1, figsize=(20,20))

smap=plt.scatter(df.col1, df.col2, s=140, 
             c=[date2num(i.date()) for i in df.index],
             marker='.')

cb = fig.colorbar(smap, orientation='vertical',
              format=DateFormatter('%d %b %y'))  

However for the second type of plot, where I am trying to use df2['col'].idxmax to create the date series instead of df.index, the following does not work:

for n in cols1:
    for m in cols2:
        fig, ax = plt.subplots(1,1, figsize=(15,15))

        maxTimes=df2[n].idxmax()
        PlottableTimes=maxTimes.dropna() #some NaNs in the 
        #.idxmax series were giving date2num trouble

        smap2=plt.scatter(df2[n].max(), df2[m].max(),
             s=160, c=[date2num(i.date()) for i in PlottableTimes], 
             marker='.')

        cb2 = fig.colorbar(smap2, orientation='vertical',
                      format=DateFormatter('%d %b %y'))  

        plt.show()

The error is: 'length of rgba sequence should be either 3 or 4'

Because the error was complaining of the color argument, I separately checked the output of the color (that is, c=) arguments in the respective plotting commands, and both look similar to me, so I can't figure out why one color argument works and the other doesn't:

one that works:

[736809.0, 736809.0, 736809.0, 736809.0, 736809.0, 736809.0, 736809.0, 736809.0, 736809.0, 736809.0, ...]

one that doesn't work:

[736845.0, 736846.0, 736847.0, 736848.0, 736849.0, 736850.0, 736851.0, 736852.0, 736853.0, 736854.0, ...]

Any suggestions or explanations? I'm running python 3.5.2. Thank you in advance for helping me understand this.

Edit 1: I made the following example for others to explore, and in the process realized the crux of the issue is different than my first question. The code below works the way I want it to:

df=pd.DataFrame(np.random.randint(low=0, high=10, size=(169, 8)), 
            columns=['a', 'b', 'c', 'd', 'e','f','g','h']) #make sample data
date_rng = pd.date_range(start='1/1/2018', end='1/8/2018', freq='H')
df['i']=date_rng
df = df.set_index('i') #get a datetime index
df['ts']=date_rng #get a datetime column to group by

from pandas import Grouper
df2=df.groupby(Grouper(key='ts', freq='D'))

for n in ['a','b','c','d']: #now make some plots
for m in ['e','f','g','h']:
    print(m)
    print(n)

    fig, ax = plt.subplots(1,1, figsize=(5,5))
    maxTimes=df2[n].idxmax()
    PlottableTimes=maxTimes.dropna()

    smap=plt.scatter(df2[n].max(), df2[m].max(), s=160, 
                     c=[date2num(i.date()) for i in PlottableTimes], 
                     marker='.')
    cb = fig.colorbar(smap, orientation='vertical',
                      format=DateFormatter('%d %b %y'))  
    plt.show()

The only difference between my real data and this example is that my real data has many NaNs scattered throughout. So, I think what is going wrong is that the 'c=' argument isn't long enough for the plotting command to interpret it as covering the whole date range...? For example, if I manually put in the output of the c= command, I get the following code which also works:

for n in ['a','b','c','d']:
    for m in ['e','f','g','h']:
        print(m)
        print(n)

        fig, ax = plt.subplots(1,1, figsize=(5,5))
        maxTimes=df2[n].idxmax()
        PlottableTimes=maxTimes.dropna()

        smap=plt.scatter(df2[n].max(), df2[m].max(), s=160, 
                     c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0, 736815.0, 736816.0], 
                     marker='.')
        cb = fig.colorbar(smap, orientation='vertical',
                      format=DateFormatter('%d %b %y'))  
        plt.show()

But, if I shorten the c= array by some amount, to emulate what is happening in my code when NaNs are being dropped from idxmax, it gives the same error I am seeing:

for n in ['a','b','c','d']:
    for m in ['e','f','g','h']:
        print(m)
        print(n)

        fig, ax = plt.subplots(1,1, figsize=(5,5))
        maxTimes=df2[n].idxmax()
        PlottableTimes=maxTimes.dropna()

        smap=plt.scatter(df2[n].max(), df2[m].max(), s=160, 
                     c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0], 
                     marker='.')
        cb = fig.colorbar(smap, orientation='vertical',
                      format=DateFormatter('%d %b %y'))  
        plt.show()

So this means the real question is: how can I grab the grouper column after grouping from the groupby object, when none of the columns appear to be grab-able with df2.col? I would like to be able to grab 'ts' from the following and use it to be the color data, instead of using idxmax:

df2['a'].max()

ts
2018-01-01    9
2018-01-02    9
2018-01-03    9
2018-01-04    9
2018-01-05    9
2018-01-06    9
2018-01-07    9
2018-01-08    8
Freq: D, Name: a, dtype: int64

Solution

  • Essentially, your Grouper call is similar to indexing on your date time column and callingpandas.DataFrame.resample specifying the aggregate function:

    df.set_index('ts').resample('D').max()
    #             a  b  c  d  e  f  g  h
    # ts                                
    # 2018-01-01  9  9  8  9  9  9  9  9
    # 2018-01-02  9  9  9  9  9  9  9  9
    # 2018-01-03  9  9  9  9  9  9  9  9
    # 2018-01-04  9  9  9  9  9  9  9  9
    # 2018-01-05  9  9  9  9  9  9  9  9
    # 2018-01-06  9  9  9  8  9  9  9  9
    # 2018-01-07  9  9  9  9  9  9  9  9
    # 2018-01-08  2  8  6  3  1  3  2  7
    

    Therefore, the return of df2['a'].max() is a Pandas Resampler object, very similar to a Pandas Series and hence carries the index property which you can use for color bar specification:

    df['a'].max().index
    
    # DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
    #                '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
    #               dtype='datetime64[ns]', name='ts', freq='D')
    

    From there you can pass into date2num without list comprehension:

    date2num(df2['a'].max().index)
    
    # array([736695., 736696., 736697., 736698., 736699., 736700., 736701., 736702.])
    

    Altogether, simply use above in loop without needing maxTimes or PlottableTimes:

    fig, ax = plt.subplots(1, 1, figsize = (5,5))
    
    smap = plt.scatter(df2[n].max(), df2[m].max(), s = 160, 
                       c = date2num(df2[n].max().index), 
                       marker = '.')
    cb = fig.colorbar(smap, orientation = 'vertical',
                      format = DateFormatter('%d %b %y'))