Search code examples
pythonpandasdataframe

Select rows from DataFrame where ID count is greater than X


I have a Dataframe which a column that contains an ID. This ID represents a person and can be present many times:

        col_id    col2     col3     col4....
row1      1
row2      1
row3      2
row4      3
row5      3
row6      3
row7      1
row8      7

I need to return a new dataframe where the value_counts of the ID column is greater than, say 2.

New dataframe:

         col_id    col2     col3     col4....
row1      1
row2      1
row3      3
row4      3
row5      3
row6      1

This new dataframe contains rows where the ID count is greater than 2 only.

Edit

From here I need to separate the data by ID. Ideally I would like a solution where I have a dataframe for each ID:

Dataframe 1

    col_id   col2    col3    col4....
r1     1
r2     1
r3     1

Dataframe 2

    col_id   col2    col3    col4....
r1     2
r2     2
r3     2

Dataframe 3

    col_id   col2    col3    col4....
r1     3
r2     3
r3     3

Is it possible to join these into one large dataframe? So I can have a new column, called 'index' that holds the rows for ID==1, ID==2, etc:

index
  1        col_id   col2    col3    col4....
       r1     1
       r2     1
       r3     1



index
  2       col_id   col2    col3    col4....
      r1     2
      r2     2
      r3     2


    index
       3      col_id   col2    col3    col4....
         r1     3
         r2     3
         r3     3

Solution

  • Use GroupBy.transform with GroupBy.size for Series with same size like original DataFrame, so possible filter by boolean indexing:

    df = df[df.groupby('col_id').transform('size') > 2]
    print (df)
          col_id
    row1       1
    row2       1
    row4       3
    row5       3
    row6       3
    row7       1
    

    If performance is not important or small DataFrame is possible use DataFrameGroupBy.filter:

    df = df.groupby('col_id').filter(lambda x: len(x) > 2)
    

    EDIT: For separate DataFrames by col_id is possible create dictionary of DataFrames:

    dfs = dict(tuple(df.groupby('col_id')))
    print (dfs[1])
          col_id
    row1       1
    row2       1
    row7       1
    
    print (dfs[2])
          col_id
    row3       2
    
    print (dfs[3])
          col_id
    row4       3
    row5       3
    row6       3
    

    Possible, but non pythonic solution with globals, not recommended use (only for fun):

    for i, df in df.groupby('col_id'):
        globals()['df{}'.format(i)] =  df
    
    print (df1)
          col_id
    row1       1
    row2       1
    row7       1