Search code examples
pythonpandasdataframechunks

Pandas - splitting dataframe into equal rows and assign number in new column as case_id in increasing order from 1 and so on


I have a large dataframe that consists of more than 100000 rows. I want to divide the rows into an equal number of chunks, let's say 4, and create a new column case_id and assign number 1 to this group of 4 rows.

id    col1   col2
0     A      A
1     B      B
2     C      C
3     D      D
4     E      E
5     F      F
6     G      G
7     H      H

column id is starting from 0.

Desired output:

id    col1   col2    new_col_case_id
0     A      A       1
1     B      B       1
2     C      C       1
3     D      D       1
4     E      E       2
5     F      F       2
6     G      G       2
7     H      H       2

any help is appreciated.


Solution

  • You can try this:

    import pandas as pd
    
    n = 4 # number of rows in each chunk
    data = {"id": [0,1,2,3,4,5,6,7],
            "col1": ["a", "b", "c", "d", "e", "f", "g", "h"],
            "col2": ["a", "b", "c", "d", "e", "f", "g", "h"]
            }
    df = pd.DataFrame.from_dict(data)
    length = len(df)
    df["new_col_case_id"] = df["id"].apply(lambda x: int(x/n) + 1)
    df = df.set_index("id") #optional
    print(df)
    

    output:

       col1 col2  new_col_case_id
    id                           
    0     a    a                1
    1     b    b                1
    2     c    c                1
    3     d    d                1
    4     e    e                2
    5     f    f                2
    6     g    g                2
    7     h    h                2