I have a dataset with mulitple rows. I want to create a new dataset based on number of duplicate rows for a column. For first dataset, i want a dataset of no duplicate rows, meaning only the row with one value. For second dataset, i want two duplicate rows and three duplicate rows,but only up to the second one. For third dataset, i want a dataset with just three duplicate rows. So as an example, i wrote codes to describe this situation. Let's Say i have a dataframe as such
x = {'column1': ['a','a','b','b','b','c','c','c','d'],
'column2': [22000,25000,27000,350,0,3,5,4,312]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
The first dataset should look like this:
x = {'column1': ['d'],
'column2': [312]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
Second dataset should look like this:
x = {'column1': ['a','a','b','b','c','c'],
'column2': [22000,25000,27000,350,3,5]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
Third dataset should look like this:
x = {'column1': ['b','b','b','c','c','c'],
'column2': [27000,350,0,3,5,4]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
How would i do this not manually?
First create counter Series
by counts:
x = {'column1': ['a','a','b','b','b','c','c','c','d'],
'column2': [22000,25000,27000,350,0,3,5,4,312]
}
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)
s = df.groupby('column1')['column1'].transform('size')
#alternative
#s = df['column1'].map(df['column1'].value_counts())
print (s)
0 2
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 1
Name: column1, dtype: int64
And then on list comprehension create DataFrames with filter by GroupBy.head
:
L = [df[s >= i].groupby(['column1',s]).head(i) if i > 1 else g for i, g in df.groupby(s)]
print (L[0])
print (L[1])
print (L[2])
column1 column2
8 d 312
column1 column2
0 a 22000
1 a 25000
2 b 27000
3 b 350
5 c 3
6 c 5
column1 column2
2 b 27000
3 b 350
4 b 0
5 c 3
6 c 5
7 c 4
EDIT1: Because solution is more complicated custom function is created and for rows with 3
values is used GroupBy.nth
:
def func(x, g):
if x == 1:
return g
else:
df1 = df[s >= x].groupby(['column1',s]).head(x)
if x == 3:
return (df1.groupby(['column1',s], group_keys=False)
.nth([0, -1])
.reset_index(level=1, drop=True)
.reset_index())
else:
return df1
L = [func(i, g) for i, g in df.groupby(s)]
print (L[0])
print (L[1])
print (L[2])
column1 column2
8 d 312
column1 column2
0 a 22000
1 a 25000
2 b 27000
3 b 350
5 c 3
6 c 5
column1 column2
0 b 27000
1 b 0
2 c 3
3 c 4