Search code examples
pythonpandasdataframeconcatenationseries

Elegant way to get range of values from two columns using pandas


I have a dataframe like as shown below (run the full code below)

df1 = pd.DataFrame({'person_id': [11,21,31,41,51],
                        'date_birth': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961']})
df1 = df1.melt('person_id', value_name='date_birth')
df1['birth_dates'] = pd.to_datetime(df1['date_birth'])
df_ranges = df1.assign(until_prev_year_days=(df1['birth_dates'].dt.dayofyear - 1),
     until_next_year_days=((df1['birth_dates'] + pd.offsets.YearEnd(0)) - df1['birth_dates']).dt.days)
f = {'until_prev_year_days': 'min', 'until_next_year_days': 'min'}

min_days = df_ranges.groupby('person_id',as_index=False).agg(f)
min_days.columns = ['person_id','no_days_to_prev_year','no_days_to_next_year']
df_offset = pd.merge(df_ranges[['person_id','birth_dates']], min_days, on='person_id',how='inner')

See below on what I tried to get the range

df_offset['range_to_shift'] = "[" + (-1 * df_offset['no_days_to_prev_year']).map(str) + "," + df_offset['no_days_to_next_year'].map(str) + "]"

Though my approach works, I would like to is there any better and elegant way to do the same

Please note that for values from no_days_to_prev_year, we have to prefix minus sign

I expect my output to be like as shown below

enter image description here


Solution

  • Use DataFrame.mul along with DataFrame.to_numpy:

    cols = ['no_days_to_prev_year', 'no_days_to_next_year']
    df_offset['range_to_shift'] = df_offset[cols].mul([-1, 1]).to_numpy().tolist()
    

    Result:

    # print(df_offset)
    
       person_id birth_dates  no_days_to_prev_year  no_days_to_next_year range_to_shift
    0         11  1967-05-29                   148                   216    [-148, 216]
    1         21  1957-01-21                    20                   344     [-20, 344]
    2         31  1959-07-27                   207                   157    [-207, 157]
    3         41  1961-01-01                     0                   364       [0, 364]
    4         51  1961-12-31                   364                     0      [-364, 0]
    

    timeit performance results:

    df_offset.shape
    (50000, 5)
    
    %%timeit -n100
    cols = ['no_days_to_prev_year', 'no_days_to_next_year']
    df_offset['range_to_shift'] = df_offset[cols].mul([-1, 1]).to_numpy().tolist()
    
    15.5 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)