Search code examples
pythonpandasduplicatesdrop

remove columns with duplicated column names only if row values are a certain string


I need to remove columns that have duplicated names, but only remove the duplicate that has all row values as 'nan' (as a string, not NaN). The example data and expected output are below. Thanks so much!

df = pd.DataFrame({'id':[1,2,3,4],'a':[0,0,0,'nan'], 'b':['nan','nan','nan','nan'], 'c':['nan','nan','nan','nan'], 'd':[1,'nan',0,2]})

df = df.rename(columns = {'a':'a','b':'a', 'c':'b', 'd':'b'})

    id  a   a   b   b
0   1   0   nan nan 1
1   2   0   nan nan nan
2   3   0   nan nan 0
3   4   nan nan nan 2

expected output

    id  a   b
0   1   0   1
1   2   0   nan
2   3   0   0
3   4   nan 2

Solution

  • The question is, if you want to keep columns which are empty, but unique. If not you can solve it in a single line: df1 = df.dropna(how='all', axis='columns') If you want to keep them you can drop them and then re-create them because they are essentially just empty after all.

    Input:

    df = pd.DataFrame({'id':[1,2,3,4],'a':[0,0,0,np.nan], 'b':[np.nan,np.nan,np.nan,np.nan], 'c':[np.nan,np.nan,np.nan,np.nan], 'd':[1,np.nan,0,2], 'e':[np.nan,np.nan,np.nan,np.nan]})
    df = df.rename(columns = {'a':'a','b':'a', 'c':'b', 'd':'b'})
    

    Code:

    df1 = df.dropna(how='all', axis='columns').copy()
    df1[list(df.columns.difference(df1.columns))] = np.nan
    

    Out:

    #    id    a    b   e
    # 0   1  0.0  1.0 NaN
    # 1   2  0.0  NaN NaN
    # 2   3  0.0  0.0 NaN
    # 3   4  NaN  2.0 NaN