Search code examples
pythonpandasfilteringanalysis

Filtering Pandas Dataframe by mean of last N values


I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.

_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)

Something like this

_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]

However, the problem here is that the value length is incorrect. Any tips?

ValueError: Series lengths must match to compare

Sample Data

This has an index on the year and month, and 2 columns.

            Col1    Col2
year    month       
2005    12  0.533835    0.170679
        12  0.494733    0.198347
2006    3   0.440098    0.202240
        6   0.410285    0.188421
        9   0.502420    0.200188
        12  0.522253    0.118680
2007    3   0.378120    0.171192
        6   0.431989    0.145158
        9   0.612036    0.178097
        12  0.519766    0.252196
2008    3   0.547705    0.202163
        6   0.560985    0.238591
        9   0.617320    0.199537
        12  0.343939    0.253855

Solution

  • Why not just boolean index directly on your filtered DataFrame with

    df[df.tail(3).mean() > df.mean()]
    

    Demo

    >>> df
       0  1  2  3  4
    0  4  8  2  4  6
    1  0  0  0  2  8
    2  5  3  0  9  3
    3  7  5  5  1  2
    4  9  7  8  9  4
    
    >>> df[df.tail(3).mean() > df.mean()]
       0  1  2  3  4
    0  4  8  2  4  6
    1  0  0  0  2  8
    2  5  3  0  9  3
    3  7  5  5  1  2
    

    Update example for MultiIndex edit

    The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.

    >>> df 
                 col1      col2
    2005 12 -0.340088 -0.574140
         12 -0.814014  0.430580
    2006 3   0.464008  0.438494
         6   0.019508 -0.635128
         9   0.622645 -0.824526
         12 -1.674920 -1.027275
    2007 3   0.397133  0.659467
         6   0.026170 -0.052063
         9   0.835561  0.608067
         12  0.736873 -0.613877
    2008 3   0.344781 -0.566392
         6  -0.653290 -0.264992
         9   0.080592 -0.548189
         12  0.585642  1.149779
    
    >>> df.loc[:,df.tail(3).mean() > df.mean()] 
                 col2
    2005 12 -0.574140
         12  0.430580
    2006 3   0.438494
         6  -0.635128
         9  -0.824526
         12 -1.027275
    2007 3   0.659467
         6  -0.052063
         9   0.608067
         12 -0.613877
    2008 3  -0.566392
         6  -0.264992
         9  -0.548189
         12  1.149779