Search code examples
pythonpandasdataframedata-masking

Pandas data masking where the conditions come from other variables


I have a dataframe and two lists as below:

seller1 = [5, 4, 3]
seller2 = [4, 2, 1] 

df = {'customer': [1, 1, 1, 2, 2, 2], 'time': [1,2,3,1,2,3], 'location': [3,4,2,4,3,3], 'demand':[10,12,15,20,8,16], 'price':[3,4,4,5,2,1]}
df = pd.DataFrame(df)

Which results in the following table:

   customer  time    location  demand price   
0   1          1        3        10     3   
1   1          2        4        12     4
2   1          3        2        15     4            
3   2          1        4        20     5 
4   2          2        3         8     2 
5   2          3        3        16     1 

The seller1 and seller2 lists show where the sellers are at time 1,2, and 3. I want to know the demand and the price if one of the sellers is there at the exact time and mask the demand data otherwise. For example, at time 1, seller one is at location 5 and seller 2 is at location 4. Likewise, customer 1 is at location 3 and customer 2 is at location 4. So, the sellers meet the first customer but not the second at t=1.

The end table I want to have is

   customer  time    location  demand  price   
0   1          1        3        None    None   
1   1          2        4        12      4
2   1          3        2        None    None            
3   2          1        4        20      5 
4   2          2        3        None    None 
5   2          3        3        16      1 

So far, I have

for i in range(df.shape[0]):
    if df["location"][i] != seller1[int(df["time"][i])-1] and df["location"][i] != seller2[int(df["time"][i])-1]:
        df["demand"][i] = np.nan
        df["price"][i] = np.nan

This is producing a SettingWithCopyWarning: and it doesn't look efficient with the for loop, either.

Is there a way to do this with df.mask()?


Solution

  • Annotated Code

    # Create tuples corresponding to time and location for each seller
    c = ['time', 'location']
    sellers = [(i, x) 
               for s in (seller1, seller2) 
               for i, x in enumerate(s, 1)]
    
    # Identify the rows where tuple pairs match with
    # the time and location from the given dataframe
    mask = df.set_index(c).index.isin(sellers)
    
    # Mask the rows where codition doesn;t hold true
    c = ['demand', 'price']
    df.loc[~mask, c] = np.nan
    

    Result

       customer  time  location  demand  price
    0         1     1         3     NaN    NaN
    1         1     2         4    12.0    4.0
    2         1     3         2     NaN    NaN
    3         2     1         4    20.0    5.0
    4         2     2         3     NaN    NaN
    5         2     3         3    16.0    1.0