Search code examples
pythonregexpandasnullfillna

Fill missing values in selected columns with filtered values in other column


I have a weird column named null in a dataframe that contains some missing values from other columns. One column is lat-lon coordinates named location, the other is an integer representing a target variable named level. In some but not all of the cases where location or level have missing values, the values that should be there are in this null column. Here's an example df:

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
     'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
     'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
     }
)

I need to be able to filter the null column according to whether the value is an integer or a string, and then based on that fill the missing value in the appropriate column with the appropriate value. I've tried using .apply() with a lambda funcation as well as .match(), .contains() and in inside of a for loop with no luck so far.


Solution

  • The easiest, if not the simplest approach, is to simply fill all the missing values in df.location and df.level with the values in df.null, then create a boolean filter with regex to return innappropriate/misassigned values in df.location and df.level to np.nan.

    pd.fillna()

    df = pd.DataFrame(
         {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
         'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
         'level': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
         }
    )
    
    for col in ['location', 'level']:
         df[col].fillna(
              value = stress.null,
              inplace = True
         )
    

    Now we'll use string expressions to correct the mis-assigned values.

    str.contains()

    # Converting columns to type str so string methods work
    df = df.astype(str)
    
    # Using regex to change values that don't belong in column to NaN
    regex = '[,]'
    df.loc[df.level.str.contains(regex), 'level'] = np.nan
        
    regex = '^\d\.?0?$'
    df.loc[df.location.str.contains(regex), 'location'] = np.nan
        
    # Returning `df.level` to float datatype (str is the correct
    # datatype for `df.location`
    df.level.astype(float)
    

    Here's the output:

    pd.DataFrame(
         {'null': {0: '43.70477575,-72.28844073', 1: '2', 2: '43.70637091,-72.28704334', 3: '4', 4: '3'},
          'location': {0: '43.70477575,-72.28844073', 1: nan, 2: '43.70637091,-72.28704334', 3: nan, 4: nan},
          'level': {0: nan, 1: '2', 2: nan, 3: '4', 4: '3'}
         }
    )