Search code examples
python-3.xpandasquartile

Replace values on dataset and apply quartile rule by row on pandas


I have a dataset with lots of variables. So I've extracted the numeric ones:

numeric_columns = transposed_df.select_dtypes(np.number)

Then I want to replace all 0 values for 0.0001

transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)

And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.

Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.


Solution

  • Use DataFrame.mask and for second condition compare by DataFrame.quantile:

    transposed_df = pd.DataFrame({
            'A':list('abcdef'),
             'B':[0,0.5,4,5,5,4],
             'C':[7,8,9,4,2,3],
             'D':[1,3,0,7,1,0],
             'E':[5,3,6,9,2,4],
             'F':list('aaabbb')
    })
    numeric_columns = transposed_df.select_dtypes(np.number)
    m1 = numeric_columns.eq(0)
    m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
    transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)
    
    print (transposed_df)
       A    B  C    D  E  F
    0  a -1.0  7  1.0  5  a
    1  b -1.0  8  3.0  3  a
    2  c  4.0  9 -1.0  6  a
    3  d  5.0 -1  7.0  9  b
    4  e  5.0  2 -1.0  2  b
    5  f  4.0  3 -1.0  4  b
    

    EDIT:

    from  scipy.stats import zscore
    print (transposed_df[numeric_columns.columns].apply(zscore))
              B         C         D         E
    0 -2.236068  0.570352 -0.408248  0.073521
    1  0.447214  0.950586  0.408248 -0.808736
    2  0.447214  1.330821 -0.816497  0.514650
    3  0.447214 -0.570352  2.041241  1.838037
    4  0.447214 -1.330821 -0.408248 -1.249865
    5  0.447214 -0.950586 -0.816497 -0.367607
    

    EDIT1:

    transposed_df = pd.DataFrame({
            'A':list('abcdef'),
             'B':[0,1,1,1,1,1],
             'C':[1,8,9,4,2,3],
             'D':[1,3,0,7,1,0],
             'E':[1,3,6,9,2,4],
             'F':list('aaabbb')
    })
    numeric_columns = transposed_df.select_dtypes(np.number)
    
    from  scipy.stats import zscore
    
    df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
    transposed_df[numeric_columns.columns] = df1
    print (transposed_df)
       A         B         C         D         E  F
    0  a -1.732051  0.577350  0.577350  0.577350  a
    1  b -1.063410  1.643452 -0.290021 -0.290021  a
    2  c -0.816497  1.360828 -1.088662  0.544331  a
    3  d -1.402136 -0.412393  0.577350  1.237179  b
    4  e -1.000000  1.000000 -1.000000  1.000000  b
    5  f -0.632456  0.632456 -1.264911  1.264911  b