Search code examples
pythonpandasspace-efficiency

How to reduce a pandas Series by performing an operation on every set of N sequential elements


Say I have a pandas series, and I want to take the mean of every set of 8 rows. I don't have prior knowledge of the size of the series, and the index may not be 0-based. I currently have the following

N = 8

s = pd.Series(np.random.random(50 * N))

n_sets = s.shape[0] // N

split = ([m * N for m in range(n_sets)],
         [m * N for m in range(1, n_sets + 1)])

out_array = np.zeros(n_sets)

for i, (a, b) in enumerate(zip(*split)):

    out_array[i] = s.loc[s.index[a:b]].mean()

Is there a shorter way to do this?


Solution

  • You could try with groupby, by slicing the index in N (you can see here an explanation of the slicing), and then use pd.Series.mean():

    newout_array=s.groupby(s.index//N).mean().to_list()
    

    Output:

    out_array  #original solution
    [0.42147899 0.55668055 0.5222594  0.46066426 0.44378491 0.52719371
     0.42479113 0.46485387 0.2800083  0.57174865 0.59207811 0.58665479
     0.52414851 0.38158931 0.51884761 0.59007469 0.3449512  0.56385373
     0.34359674 0.44524997 0.44175351 0.42339394 0.5687501  0.3140091
     0.40985639 0.46649486 0.3101396  0.45664647 0.51829052 0.38875796
     0.45428001 0.52979064 0.62545921 0.64782618 0.65265239 0.56976799
     0.64277369 0.33528876 0.45973874 0.45341751 0.52690983 0.66427599
     0.59814577 0.35575622 0.62995929 0.61582329 0.38971679 0.4771326
     0.50889137 0.25105353]
    
    
    newout_array  #new solution
    
    [0.4214789945860148, 0.5566805507021909, 0.5222593998859411, 0.46066425607167216, 0.4437849132421554, 0.5271937114894408,
     0.424791134573943, 0.4648538659945887, 0.28000829556024387, 0.5717486453029332, 0.5920781058695997, 0.5866547941460012, 
     0.5241485100329547, 0.38158931177460725, 0.5188476113762392, 0.5900746905953183, 0.34495119855714756, 0.5638537286251522, 
     0.3435967359945349, 0.44524997190104454, 0.44175351484451975, 0.42339393886425913, 0.5687501027416468, 0.3140090963728155, 
     0.40985639015924036, 0.4664948621046134, 0.3101396034068746, 0.45664647332866076, 0.5182905157666298, 0.38875796468438406, 
     0.4542800111275337, 0.5297906368971982, 0.6254592119278896, 0.6478261817988752, 0.6526523935382951, 0.569767994485338, 
     0.642773691835847, 0.3352887578683835, 0.45973873832126594, 0.45341751320112617, 0.5269098312525405, 0.6642759923683706, 
     0.5981457683986061, 0.3557562229383897, 0.6299592930489117, 0.6158232897272005, 0.38971678834383916, 0.4771325988592886, 
     0.5088913710936904, 0.25105352820427246]
    

    The difference it's because the number of decimals of each format, if you want to have only 8 decimals as the original out_array, you could try to map the elements with round function:

    newout_array=s.groupby(s.index//N).mean().to_list()
    newout_array=list(map(lambda x: round(x,8),newout_array))