Search code examples
regexpandasdataframeextracttext-manipulation

Replace empty values based on part of the text from another variable in Pandas dataframe, using filter and regex expression


I want to replace empty values with part of the text I find in another variable in Pandas. To achieve this, I need to make use of a regex expression to extract the exact text value I want transferred, but also apply a filter so that only those rows that have no value from begin with will be subject to change.

In SAS this is straight forward, but I am struggling doing the same in Python/pandas.

The example below is a simplified version of my problem. Specifically, I need to replace any empty values for the variable Mount with part of the text in the variable Lens that is preceded by the word, “til” (means “for” in English), in this example, second row, the word “Canon”. If Mount is not missing for a particular row, then nothing happens (as can be seen in first row).

I have come up with a self-constructed solution below that sort of works, but feel there is a more efficient way to do it. Especially this temporary variable Mount_tmp seems unnecessary. Any thoughts and ideas to improve my code would be appreciated. Thanks.

data = {'Lens': ['Canon EF 50mm f/1.8 STM', 'Zeiss Planar T* 85mm f/1.4 til Canon'],
   'Mount': ['Canon E', np.nan]}

frame = pd.DataFrame(data)

#Generate temporary variable
frame['Mount_tmp'] = frame['Lens'].str.extract(r'til (\w+\s*\w*)')

#Replace empty data in variable Mount with existing data from Mount_tmp
filt = frame['Mount'].isnull()
frame.loc[filt, 'Mount'] = frame.loc[filt, 'Mount_tmp']
frame.drop('Mount_tmp', axis=1, inplace=True)

Solution

  • Try:

    mask = frame.Mount.isna()
    frame.loc[mask, "Mount"] = frame.loc[mask, "Lens"].str.extract(r"til\s+(.*)")[0]
    print(frame)
    

    Prins:

                                       Lens    Mount
    0               Canon EF 50mm f/1.8 STM  Canon E
    1  Zeiss Planar T* 85mm f/1.4 til Canon    Canon