assume df as;
data = {'duration':['1week 3day 2hour 4min 23', '2hour 4min 23sec', '2hour 4min', np.nan, '', '23sec']}
df = pd.DataFrame(data)
I'm trying to calculate the duration as sum of seconds. Replaced the values as:
df['duration'] = df['duration'].str.replace('week', '*604800+') \
.str.replace('day', '*604800+') \
.str.replace('hour', '*3600+') \
.str.replace('min', '*60+') \
.str.replace('sec', '') \
.str.replace(' ', '')
But cant run eval functions like (pd.eval, apply.eval, eval etc). Some cells ends with '+' sign or other string/na problems.. Any help?
Ps: This is not a duplicate question.
You can use a regex combined to a custom function to replace weeks by 7 days and add seconds on lonely numbers (you can add other units). Then convert to_timedelta
:
def change_units(m):
d = {'week': (7, 'days'), '': (1, 's')}
_, i, period = m.groups()
factor, txt = d[period]
return f'{factor*int(i)}{txt}'
df['delta'] = pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
replace, regex=True))
output:
duration delta
0 1week 3day 2hour 4min 23 10 days 02:04:23
1 2hour 4min 23sec 0 days 02:04:23
2 2hour 4min 0 days 02:04:00
3 NaN NaT
4 NaT
5 23sec 0 days 00:00:23
Then you can benefit from the TimeDelta object, for example to convert to total_seconds
:
pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
change_units, regex=True)
).dt.total_seconds()
output:
0 871463.0
1 7463.0
2 7440.0
3 NaN
4 NaN
5 23.0
Name: duration, dtype: float64