Search code examples
pythonpandasperformancebatch-processing

Pandas dataframe : Operation per batch of rows


I have a pandas DataFrame df for which I want to compute some statistics per batch of rows.

For example, let's say that I have a batch_size = 200000.

For each batch of batch_size rows I would like to have the number of unique values for a column ID of my DataFrame.

How can I do something like that ?

Here is an example of what I want :

print(df)

>>
+-------+
|     ID|
+-------+
|      1|
|      1|
|      2|
|      2|
|      2|
|      3|
|      3|
|      3|
|      3|
+-------+

batch_size = 3

my_new_function(df,batch_size)

>>
For batch 1 (0 to 2) :
2 unique values 
1 appears 2 times
2 appears 1 time

For batch 2 (3 to 5) : 
2 unique values 
2 appears 2 times
3 appears 1 time

For batch 3 (6 to 8) 
1 unique values 
3 appears 3 times

Note : The output can of course be a simple DataFrame


Solution

  • See here for splitting the dataframe. After that I would do:

    from collections import Counter
    Counter(batch_df['ID'].tolist())