Search code examples
pythonpandasconditional-statementsdrop

python pandas drop duplicate columns by condition


i want to drop duplicate columns by condition so what i want to do is where "type" is the same(duplicate) drop the "number" one

i got this

data={"col1":[2,3,4,5,9,2,6],
"col2":[4,2,4,6,0,1,5],
"col3":[7,6,0,11,3,6,7],
"col4":[14,11,22,8,6,3,9],
"col5":[0,5,7,3,8,2,9],
"type":["A","A","C","D","B","B","E"],
"number":["one","two","two","one","one","two","two"]}
df=pd.DataFrame.from_dict(data)

and i want to have this

data={"col1":[3,4,5,2,6],
"col2":[2,4,6,1,5],
"col3":[6,0,11,6,7],
"col4":[11,22,8,3,9],
"col5":[5,7,3,2,9],
"type":["A","C","D","B","E"],
"number":["two","two","one","two","two"]}
df=pd.DataFrame.from_dict(data)

Solution

  • You can chain 2 conditions - select all non one values by compare for Series.ne and inverted mask with Series.duplicated:

    df1 = df[df['number'].ne('one') | ~df['type'].duplicated(keep=False)]
    print (df1)
       col1  col2  col3  col4  col5 type number
    1     3     2     6    11     5    A    two
    2     4     4     0    22     7    C    two
    3     5     6    11     8     3    D    one
    5     2     1     6     3     2    B    two
    6     6     5     7     9     9    E    two
    

    Another idea with ordered categoricals:

    cats = pd.unique(['one'] + df['number'].unique().tolist())
    
    df['number'] = pd.Categorical(df['number'], categories=cats, ordered=True)
    
    df2 = df.sort_values('number').drop_duplicates(subset=['type'], keep='last').sort_index()
    print (df2)
       col1  col2  col3  col4  col5 type number
    1     3     2     6    11     5    A    two
    2     4     4     0    22     7    C    two
    3     5     6    11     8     3    D    one
    5     2     1     6     3     2    B    two
    6     6     5     7     9     9    E    two