Search code examples
pythonpandaseval

pandas evaluating strings as numeric


assume df as;

data = {'duration':['1week 3day 2hour 4min 23', '2hour 4min 23sec', '2hour 4min', np.nan, '', '23sec']}
df = pd.DataFrame(data)

I'm trying to calculate the duration as sum of seconds. Replaced the values as:

df['duration'] = df['duration'].str.replace('week', '*604800+') \
                                                                 .str.replace('day', '*604800+') \
                                                                 .str.replace('hour', '*3600+') \
                                                                 .str.replace('min', '*60+') \
                                                                 .str.replace('sec', '')  \
                                                                 .str.replace(' ', '')

But cant run eval functions like (pd.eval, apply.eval, eval etc). Some cells ends with '+' sign or other string/na problems.. Any help?

Ps: This is not a duplicate question.


Solution

  • You can use a regex combined to a custom function to replace weeks by 7 days and add seconds on lonely numbers (you can add other units). Then convert to_timedelta:

    def change_units(m):
        d = {'week': (7, 'days'), '': (1, 's')}
        _, i, period = m.groups()
        factor, txt = d[period]
        return f'{factor*int(i)}{txt}'
        
    df['delta'] = pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
                                                             replace, regex=True))
    

    output:

                       duration            delta
    0  1week 3day 2hour 4min 23 10 days 02:04:23
    1          2hour 4min 23sec  0 days 02:04:23
    2                2hour 4min  0 days 02:04:00
    3                       NaN              NaT
    4                                        NaT
    5                     23sec  0 days 00:00:23
    

    Then you can benefit from the TimeDelta object, for example to convert to total_seconds:

    pd.to_timedelta(df['duration'].str.replace(r'((\d)\s*(week|)\b)',
                                               change_units, regex=True)
                   ).dt.total_seconds()
    

    output:

    0    871463.0
    1      7463.0
    2      7440.0
    3         NaN
    4         NaN
    5        23.0
    Name: duration, dtype: float64