I have a Dataframe which a column that contains an ID. This ID represents a person and can be present many times:
col_id col2 col3 col4....
row1 1
row2 1
row3 2
row4 3
row5 3
row6 3
row7 1
row8 7
I need to return a new dataframe where the value_counts of the ID column is greater than, say 2.
New dataframe:
col_id col2 col3 col4....
row1 1
row2 1
row3 3
row4 3
row5 3
row6 1
This new dataframe contains rows where the ID count is greater than 2 only.
Edit
From here I need to separate the data by ID. Ideally I would like a solution where I have a dataframe for each ID:
Dataframe 1
col_id col2 col3 col4....
r1 1
r2 1
r3 1
Dataframe 2
col_id col2 col3 col4....
r1 2
r2 2
r3 2
Dataframe 3
col_id col2 col3 col4....
r1 3
r2 3
r3 3
Is it possible to join these into one large dataframe? So I can have a new column, called 'index' that holds the rows for ID==1, ID==2, etc:
index
1 col_id col2 col3 col4....
r1 1
r2 1
r3 1
index
2 col_id col2 col3 col4....
r1 2
r2 2
r3 2
index
3 col_id col2 col3 col4....
r1 3
r2 3
r3 3
Use GroupBy.transform
with GroupBy.size
for Series
with same size like original DataFrame, so possible filter by boolean indexing
:
df = df[df.groupby('col_id').transform('size') > 2]
print (df)
col_id
row1 1
row2 1
row4 3
row5 3
row6 3
row7 1
If performance is not important or small DataFrame
is possible use DataFrameGroupBy.filter
:
df = df.groupby('col_id').filter(lambda x: len(x) > 2)
EDIT: For separate DataFrames by col_id
is possible create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('col_id')))
print (dfs[1])
col_id
row1 1
row2 1
row7 1
print (dfs[2])
col_id
row3 2
print (dfs[3])
col_id
row4 3
row5 3
row6 3
Possible, but non pythonic solution with globals, not recommended use (only for fun):
for i, df in df.groupby('col_id'):
globals()['df{}'.format(i)] = df
print (df1)
col_id
row1 1
row2 1
row7 1