Search code examples
pythonpandasdataframerowsdivide

pandas: divide a dataframe based on a condition in certain columns and rows


I have a dataframe (much larger than this example)as follows where all rows in the first two columns are repeated 5 times.

import pandas as pd
df = pd.DataFrame({'text':['the weather is nice','the weather is nice','the weather is nice','the weather is nice','the weather is nice',
                        'the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful',
                        'the day is long','the day is long','the day is long','the day is long','the day is long'],
               'reference':['weather','weather','weather','weather','weather',
                            'house','house','house','house','house',
                            'day','day','day','day','day'],
               'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})

I would like to divide this pandas dataframe to two dataframes in a way that the first two consecutive rows appear in one and the three others appear in the second dataframe as follows.

The desired output:

first df:

                      text reference  id
0      the weather is nice   weather   1
1      the weather is nice   weather   2
3   the house is beautiful     house   6
4   the house is beautiful     house   7
5         the day is long       day  11
6         the day is long       day  12

second df:
                      text reference  id
0      the weather is nice   weather   3
1      the weather is nice   weather   4
2      the weather is nice   weather   5
3   the house is beautiful     house   8
4   the house is beautiful     house   9
5   the house is beautiful     house  10
6         the day is long       day  13
7         the day is long       day  14
8         the day is long       day  15

obviously selecting n-rows does not work (e,g df.iloc[::3, :] or df[df.index % 3 == 0]) so I would like to know how the above-mentioned output would be possible.


Solution

  • If you want to group on the value of reference (first 2 items vs rest):

    mask = df.groupby('reference').cumcount().gt(1)
    groups = [g for k,g in df.groupby(mask)]
    
    # or manually
    # df1 = df[~mask]
    # df2 = df[mask]
    

    Using position:

    mask = (np.arange(len(df))%5)<1
    
    # or with a range index
    # mask = df.index.mod(5).gt(1)
    
    # then same as above using groupby or slicing