Search code examples
pythonpandasdataframeconditional-formattingoutliers

How to drop part of the values from one column by condition from another column in Python, Pandas?


I have real estate dataframe with many outliers and many observations. I have variables: total area, number of rooms (if rooms = 0, then it's studio appartment) and kitchen_area.

"Minimalized" extraction from my dataframe:

dic = [{'area': 40, 'kitchen_area': 10, 'rooms': 1, 'price': 50000 },
 {'area': 20, 'kitchen_area': 0, 'rooms': 0, 'price': 50000},
 {'area': 60, 'kitchen_area': 0, 'rooms': 2, 'price': 70000},
 {'area': 29, 'kitchen_area': 9, 'rooms': 1, 'price': 30000},
 {'area': 15, 'kitchen_area': 0, 'rooms': 0, 'price': 25000}]
df = pd.DataFrame(dic, index=['apt1', 'apt2','apt3','apt4', 'apt5'])

My target would be to eliminate apt3, because by law, kitchen area cannot be smaller than 5 squared meters in non-studio apartments. In other words, I would like to eliminate all rows from my dataframe containing the data about apartments which are non-studio (rooms>0), but have kitchen_area <5

I have tried code like this:

df1 = df.drop(df[(df.rooms > 0) & (df.kitchen_area < 5)].index)

But it just eliminated all data from both columns kitchen_area and rooms according to the multiple conditions I put.


Solution

  • Clean

    mask1 = df.rooms > 0
    mask2 = df.kitchen_area < 5
    
    df1 = df[~(mask1 & mask2)]
    df1
    
          area  kitchen_area  rooms  price
    apt1    40            10      1  50000
    apt2    20             0      0  50000
    apt4    29             9      1  30000
    apt5    15             0      0  25000
    

    pd.DataFRame.query

    df1 = df.query('rooms == 0 | kitchen_area >= 5')
    df1
    
          area  kitchen_area  rooms  price
    apt1    40            10      1  50000
    apt2    20             0      0  50000
    apt4    29             9      1  30000
    apt5    15             0      0  25000