Search code examples
pythonpython-3.xpandasindexingmask

Pandas: Conditionally fill column using a function based on other columns values


I have a Pandas DataFrame that contains two sets of coordinates (lat1, lon1, lat2, lon2). I have a function that computes distance using these coordinates. But some of the rows in the dataframe are invalid. I would like to apply my function only to valid rows and save the result of the function to a 'dist' column (the column already exists in the dataframe). I want something like this SQL:

UPDATE dataframe
SET dist=calculate_dist(lat1, lon1, lat2, lon2)
WHERE lat1 IS NOT NULL AND lat2 IS NOT NULL AND user_id>100;

How can I achieve this?

I tried using df = df.apply(calculate_dist, axis=1) but with this approach I need to process all rows, not only the rows that match my conditions, and I need to have an if statement inside the calculate_dist function that ignores invalid rows. Is there a better way?

I know that similar questions already appeared on StackOverflow but I could not find any question that utilizes both a function and conditional selection of rows.


Solution

  • I think you need filter by boolean indexing first:

    mask = (df.lat1.notnull()) & (df.lat2.notnull()) & (df.user_id>100)
    
    df['dist'] = df[mask].apply(calculate_dist, axis=1)
    

    Sample:

    df = pd.DataFrame({'lat1':[1,2,np.nan,1],
                       'lon1':[4,5,6,2],
                       'lat2':[7,np.nan,9,3],
                       'lon2':[1,3,5,1],
                       'user_id':[200,30,60,50]})
    
    print (df)
       lat1  lat2  lon1  lon2  user_id
    0   1.0   7.0     4     1      200
    1   2.0   NaN     5     3       30
    2   NaN   9.0     6     5       60
    3   1.0   3.0     2     1       50
    
    #function returning Series
    def calculate_dist(x):
        return x.lat2 - x.lat1
    
    mask = (df.lat1.notnull()) & (df.lat2.notnull()) & (df.user_id>100)
    df['dist'] = df[mask].apply(calculate_dist, axis=1)
    print (df)
       lat1  lat2  lon1  lon2  user_id  dist
    0   1.0   7.0     4     1      200   6.0
    1   2.0   NaN     5     3       30   NaN
    2   NaN   9.0     6     5       60   NaN
    3   1.0   3.0     2     1       50   NaN