Search code examples
pythonpandasnumpydata-manipulationdata-cleaning

How to extract data based on number of duplicate rows?


I have a dataset with mulitple rows. I want to create a new dataset based on number of duplicate rows for a column. For first dataset, i want a dataset of no duplicate rows, meaning only the row with one value. For second dataset, i want two duplicate rows and three duplicate rows,but only up to the second one. For third dataset, i want a dataset with just three duplicate rows. So as an example, i wrote codes to describe this situation. Let's Say i have a dataframe as such

x = {'column1': ['a','a','b','b','b','c','c','c','d'],
    'column2': [22000,25000,27000,350,0,3,5,4,312]
    }
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)

The first dataset should look like this:

x = {'column1': ['d'],
    'column2': [312]
    }
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)

Second dataset should look like this:

x = {'column1': ['a','a','b','b','c','c'],
    'column2': [22000,25000,27000,350,3,5]
    }
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)

Third dataset should look like this:

x = {'column1': ['b','b','b','c','c','c'],
    'column2': [27000,350,0,3,5,4]
    }
df = pd.DataFrame(x, columns = ['column1', 'column2'])
print (df)

How would i do this not manually?


Solution

  • First create counter Series by counts:

    x = {'column1': ['a','a','b','b','b','c','c','c','d'],
        'column2': [22000,25000,27000,350,0,3,5,4,312]
        }
    df = pd.DataFrame(x, columns = ['column1', 'column2'])
    print (df)
    
    s = df.groupby('column1')['column1'].transform('size')
    #alternative
    #s = df['column1'].map(df['column1'].value_counts())
    print (s)
    0    2
    1    2
    2    3
    3    3
    4    3
    5    3
    6    3
    7    3
    8    1
    Name: column1, dtype: int64
    

    And then on list comprehension create DataFrames with filter by GroupBy.head:

    L = [df[s >= i].groupby(['column1',s]).head(i) if i > 1 else g for i, g in df.groupby(s)]
    print (L[0])
    print (L[1])
    print (L[2])
      column1  column2
    8       d      312
      column1  column2
    0       a    22000
    1       a    25000
    2       b    27000
    3       b      350
    5       c        3
    6       c        5
      column1  column2
    2       b    27000
    3       b      350
    4       b        0
    5       c        3
    6       c        5
    7       c        4
    

    EDIT1: Because solution is more complicated custom function is created and for rows with 3 values is used GroupBy.nth:

    def func(x, g):
        if x == 1:
            return g
        else:
            df1 = df[s >= x].groupby(['column1',s]).head(x)
            if x == 3:
                return (df1.groupby(['column1',s], group_keys=False)
                           .nth([0, -1])
                           .reset_index(level=1, drop=True)
                           .reset_index())
    
            else:
                return df1
    
    L = [func(i, g) for i, g in df.groupby(s)]
    print (L[0])
    print (L[1])
    print (L[2])
      column1  column2
    8       d      312
      column1  column2
    0       a    22000
    1       a    25000
    2       b    27000
    3       b      350
    5       c        3
    6       c        5
      column1  column2
    0       b    27000
    1       b        0
    2       c        3
    3       c        4