Search code examples
pythonpandasreplacecontainspandas-loc

Pandas conditional, contains, replace


I am hoping someone can help me with this.

I have a df which I want to use replace on column 'Length' but I want to only use replace on certain rows which fit the criteria (which I have been able to do). The problem is the rows that do not fit the criteria which end up with n/a's but I really want to keep the existing data in those rows i.e. not run the replace method.

Here is the code to make the selection, criteria does not contain 'hour'.

mask = ~data['Length'].str.contains("hour") 

This is my code for replacing which is working fine. I am writing over the original data.

data['Length'] = data.loc[mask, 'Length'].replace(r'([1])[h]\s', r'\1 hour ', regex=True)
data['Length'] = data.loc[mask, 'Length'].replace(r'([2-9]*)[h]\s', r'\1 hours ', regex=True)
data['Length'] = data.loc[mask, 'Length'].replace(r'([0-9]*)[m]', r'\1 minutes', regex=True)

Input data. Row 0 has the format I require in the original data. Other rows need to be run through the replace functions.

0    1 hour 30 minutes
1               2h 45m
2               4h 30m
3               1h 45m
4               1h 45m
5               1h 45m
6               1h 45m
7               1h 45m

Output currently, I want to keep the original data in row 0.

0                   NaN
1    2 hours 45 minutes
2    4 hours 30 minutes
3     1 hour 45 minutes
4     1 hour 45 minutes
5     1 hour 45 minutes
6     1 hour 45 minutes
7     1 hour 45 minutes

Now I am not married to the methods I have used so if there is a better way then I am open to suggestions, but hopefully there is something simple I am missing. It seems like it should be so simple as it's just an if-else. thanks


Solution

  • You must pass in the mask on the left side of your assignment as well:

    # Add ".loc[mask, " on the left side:
    data.loc[mask, 'Length'] = data.loc[mask, 'Length'].replace(r'([1])[h]\s', r'\1 hour ', regex=True)
    data.loc[mask, 'Length'] = data.loc[mask, 'Length'].replace(r'([2-9]*)[h]\s', r'\1 hours ', regex=True)
    data.loc[mask, 'Length'] = data.loc[mask, 'Length'].replace(r'([0-9]*)[m]', r'\1 minutes', regex=True)
    

    Otherwise you are mutating the whole data['Length'] column by assigning a filtered (lower-length) series to it, which makes the empty slots being auto-filled with np.nan.