Search code examples
excelpython-3.xpandasnumpydata-cleaning

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN


I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.

My Dataset in Excel:

enter image description here

  1. I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
  2. Now I want to drop/ delete the raws contains the NaN values

Expected output should be like:

enter image description here


Solution

  • You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:

    import numpy as np
    import pandas as pd
    
    
    data = pd.read_csv('c.csv')
    print(data)
    data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
    data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
    data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
    print(data)
    data = data.dropna()
    print(data)
    

    Orignal data:

        A   B   C
    0   1  10  20
    1   2  11  22
    2   4  15  25
    3   8  20  30
    4  12  25  35
    5  18  40  55
    6  20  45  60
    

    Output with NaN:

          A     B     C
    0   NaN   NaN   NaN
    1   NaN   NaN   NaN
    2   NaN   NaN   NaN
    3   NaN  20.0  30.0
    4  12.0  25.0  35.0
    5  18.0  40.0  55.0
    6  20.0  45.0  60.0
    

    Final Output:

          A     B     C
    4  12.0  25.0  35.0
    5  18.0  40.0  55.0
    6  20.0  45.0  60.0
    

    Try this for non-integer numbers:

    import numpy as np
    import pandas as pd
    
    
    data = pd.read_csv('c.csv')
    print(data)
    data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
    data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
    data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
    print(data)
    data = data.dropna()
    print(data)
    

    Output:

           A      B      C
    0   1.25  10.56  20.11
    1   2.39  11.19  22.92
    2   4.00  15.65  25.27
    3   8.89  20.31  30.15
    4  12.15  25.91  35.64
    5  18.29  40.15  55.98
    6  20.46  45.00  60.48
           A      B      C
    0    NaN    NaN    NaN
    1    NaN    NaN    NaN
    2    NaN    NaN    NaN
    3    NaN  20.31  30.15
    4  12.15  25.91  35.64
    5  18.29  40.15  55.98
    6  20.46  45.00  60.48
           A      B      C
    4  12.15  25.91  35.64
    5  18.29  40.15  55.98
    6  20.46  45.00  60.48