Search code examples
pythonpandassklearn-pandas

How to drop 2 or more columns based on column name, type and statistics such as std in Python?


I am trying to drop all the columns containing k1 and p5 and have a new DataFrame without those columns but I get an error.

df_r1 =  df.filter(regex=("k1\s")) #filter all k1 columns
df_r2 =  df.filter(regex=("p5\s")) #filter all p5 columns

df_dropped_new = df.drop((df_r1,df_r2),axis = 1)

and the above line gets me the following error

DataFrame' objects are mutable, thus they cannot be hashed

Furthermore, I am performing 2 more filters to my dataFrame. The first one is dropping columns containing "objects" and the second one is dropping columns that have std=0.


Solution

  • You can use:

    df = pd.DataFrame({'A':list('abcdef'),
                       'p5 sd':[4,5,4,5,5,4],
                       'C':[7,8,9,4,2,3],
                       'k1 s':[1,3,5,7,1,0],
                       'V':[5,5,5,5,5,5],
                       'k1 d':list('aaabbb')})
    
    print (df)
       A  C  V k1 d  k1 s  p5 sd
    0  a  7  5    a     1      4
    1  b  8  5    a     3      5
    2  c  9  5    a     5      4
    3  d  4  5    b     7      5
    4  e  2  5    b     1      5
    5  f  3  5    b     0      4
    

    First check if std is 0, but it remove all not numeric columns, so add them by reindex with fill_value=True:

    m1 = df.std().eq(0).reindex(df.columns, fill_value=True)
    

    Then check if contains columns names pattern:

    m2 = m1.index.str.contains("p5\s|k1\s")
    
    print (m1)
    A         True
    C        False
    V         True
    k1 d      True
    k1 s     False
    p5 sd    False
    dtype: bool
    
    print (m2)
    [False False False  True  True  True]
    

    Chain both masks together by or (|):

    m = m1 | m2
    print (m)
    A         True
    C        False
    V         True
    k1 d      True
    k1 s      True
    p5 sd     True
    dtype: bool
    

    Last filter, but is necessary invert mask by ~:

    df = df.loc[:, ~m]
    print (df)
       C
    0  7
    1  8
    2  9
    3  4
    4  2
    5  3
    

    EDIT:

    Mask for remove not numeric columns is not necessary, because std automatically removing non numeric columns:

    m0 = df.columns.isin(df.select_dtypes(include=['object']))
    m1 = df.std().eq(0).reindex(df.columns, fill_value=True)
    m2 = m1.index.str.contains("p5\s|k1\s")
    m = m1 | m2 | m0
    print (m)
    A         True
    C        False
    V         True
    k1 d      True
    k1 s      True
    p5 sd     True
    dtype: bool
    
    df = df.loc[:, ~m]
    print (df)
       C
    0  7
    1  8
    2  9
    3  4
    4  2
    5  3