Search code examples
pythonpandasdataframemulti-index

Efficient and performant way to calculate the change percentage columns


I am using yfinance via pandas datareader to download multiple-symbols' multi-year data and am trying to calculate 'MTDChg', 'YTDChg' and I figure this is one of the slowest parts in the runtime.

Here is the snippet of the code, where I have reservations about picking of the end of the previous period vis-a-viz availability of the data in the index itself. It is a DataFrame with multiple columns.

I am curious and trying to figure out if there is a better way around this. Using asfreq looks appealing but I am afraid that I will not be able to use the actual reference for starting or ending periods, for which the data may or may not exist in the index. I am thinking of using a applymap but not really sure of how to go about it in terms of a functional code for it and if that would be better in terms of performance.

Any ideas or suggestion on how do I go about this?

    import yfinance as yf
    import pandas_datareader as pdr
    import datetime as dt
    from pandas_datareader import data as pdr
    yf.pdr_override()
 
    y_symbols = ['GOOG', 'MSFT', 'TSLA']
    price_feed = pdr.get_data_yahoo(y_symbols, 
                                      start = dt.datetime(2020,1,1),
                                      end = dt.datetime(2022,12,1),
                                      interval = "1d")

    for dt in price_feed.index:
        dt_str = dt.strftime("%Y-%m-%d")
        current_month_str = f"{dt.year}-{dt.month}"
        previous_month_str = f"{dt.year}-{dt.month - 1}"
        current_year_str = f"{dt.year}"
        previous_year_str = f"{dt.year - 1}"
        
        
        if previous_month_str in price_feed.index:
            previous_month_last_day = price_feed.loc[previous_month_str].index[-1].strftime("%Y-%m-%d")
        else:
            previous_month_last_day = price_feed.loc[current_month_str].index[0].strftime("%Y-%m-%d")
            
            
        if previous_year_str in price_feed.index:
            previous_year_last_day = price_feed.loc[previous_year_str].index[-1].strftime("%Y-%m-%d")
        else:
            previous_year_last_day = price_feed.loc[current_year_str].index[0].strftime("%Y-%m-%d")
            
            
        if dt.month == 1 or dt.month == 2 or dt.month == 3:
            previous_qtr_str = f"{dt.year - 1}-12"
            current_qtr_str  = f"{dt.year}-01"
        elif dt.month == 4 or dt.month == 5 or dt.month == 6:
            previous_qtr_str = f"{dt.year}-03"
            current_qtr_str  = f"{dt.year}-04"
        elif dt.month == 7 or dt.month == 8 or dt.month == 9:
            previous_qtr_str = f"{dt.year}-06"
            current_qtr_str  = f"{dt.year}-07"
        elif dt.month == 10 or dt.month == 11 or dt.month == 12:
            previous_qtr_str = f"{dt.year}-09"
            current_qtr_str  = f"{dt.year}-10"
        else:
            previous_qtr_str = f"{dt.year}-09"
            current_qtr_str  = f"{dt.year}-10"
                    
        if previous_qtr_str in price_feed.index:
            #print("Previous quarter string is present in price feed for ", dt_str)
            previous_qtr_last_day = price_feed.loc[previous_qtr_str].index[-1].strftime("%Y-%m-%d")
            #print("Last quarter last day is", previous_qtr_last_day)
        elif current_qtr_str in price_feed.index:
            previous_qtr_last_day = price_feed.loc[current_qtr_str].index[0].strftime("%Y-%m-%d")
            #print("Previous quarter is not present in price feed")
            #print("Last quarter last day is", previous_qtr_last_day)
        else:
            previous_qtr_last_day = price_feed.loc[current_month_str].index[0].strftime("%Y-%m-%d")
            #print("Previous quarter string is NOT present in price feed")
            #print("Last quarter last day is", previous_qtr_last_day)
            
        #print(dt.day, current_month_str, previous_month_last_day)
        for symbol in y_symbols:
            #print(symbol, dt.day, previous_month_last_day, "<--->", pivot_calculations.loc[dt, ('Close', symbol)],  pivot_calculations.loc[previous_month_last_day, ('Close', symbol)])
            mtd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_month_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_month_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('MTDChg', symbol)] = round(mtd_perf, 2)
            # calculate the qtd performance values
            qtd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_qtr_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_qtr_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('QTDChg', symbol)] = round(qtd_perf, 2)
            ytd_perf = (pivot_calculations.loc[dt, ('Close', symbol)] - pivot_calculations.loc[previous_year_last_day, ('Close', symbol)]) / pivot_calculations.loc[previous_year_last_day, ('Close', symbol)] * 100
            pivot_calculations.loc[dt_str, ('YTDChg', symbol)] = round(qtd_perf, 2)```

Solution

  • IIUC, you are looking at period-to-date, for example "month-to-date" percent change calculations. The method below does the calculations for all three period types in 11.6 ms.

    First: definition

    The formal definition of "month to date" is:

    A period starting at the beginning of the current calendar month and ending at the current date. Month-to-date is used in many contexts, mainly for recording results of an activity in the time between a date (exclusive, since this day may not yet be "complete") and the beginning of the current month.

    More usual, however, and consistent with the calculation you seem to want, is: "from the Close of the previous period to the Close of current date (inclusive)". For example, the (usual, not formal) MTD change for '2020-01-07' would be the change between close('2019-12-31') and close('2020-01-07').

    You introduce a twist that I think is a bit dangerous: if the basis (last of the previous period) isn't present in the data, then use the first day of the current period (and I presume you'd prefer to use Open than Close as basis for that initial period). I think it is safer and more correct to extend your data fetching to a few days earlier, and, after calculation, discard these extra days (see "Addendum" below).

    Anyway, here is a way to do what I think you are asking for. We first calculate a basis for the desired period. For example, for "month-to-date", that basis would be:

    basis = price_feed['Close'].shift().fillna(
        price_feed['Open']).groupby(pd.Grouper(freq='M')).transform('first')
    

    To verify that basis:

    >>> basis.loc['2020-01-30':'2020-02-04']
                     GOOG        MSFT       TSLA
    Date                                        
    2020-01-30  67.077499  158.779999  28.299999
    2020-01-31  67.077499  158.779999  28.299999
    2020-02-03  71.711502  170.229996  43.371334
    2020-02-04  71.711502  170.229996  43.371334
    

    Observe that each day of the month has the Close of the previous month, when available. For the first month (where the previous month is not available), we used the current month's Open.

    Now, the percentage change is simply:

    >>> 100 * (price_feed['Close'] - basis) / basis
                    GOOG      MSFT       TSLA
    Date                                     
    2020-01-02  1.924640  1.158834   1.356893
    2020-01-03  1.424468 -0.100771   4.360428
    2020-01-06  3.925315  0.157451   6.369850
    ...              ...       ...        ...
    2022-11-28  1.679692  4.148533 -19.609737
    2022-11-29  0.824000  3.532502 -20.528256
    2022-11-30  7.173033  9.912546 -14.432626
    

    Putting it all together, for all periods of interest:

    gb = price_feed['Close'].shift().fillna(price_feed['Open']).groupby
    out = {
        name: 100 * (price_feed['Close'] - basis) / basis
        for name, freq in [
            ('MTD', 'M'),
            ('QTD', 'Q'),
            ('YTD', 'Y')
        ]
        for basis in [gb(pd.Grouper(freq=freq)).transform('first')]
    }
    
    >>> out['YTD']
                     GOOG       MSFT       TSLA
    Date                                       
    2020-01-02   1.924640   1.158834   1.356893
    2020-01-03   1.424468  -0.100771   4.360428
    2020-01-06   3.925315   0.157451   6.369850
    ...               ...        ...        ...
    2022-11-28 -33.473645 -28.116083 -48.072448
    2022-11-29 -34.033502 -28.541271 -48.665759
    2022-11-30 -29.879496 -24.137728 -44.728328
    

    Addendum: safer way (load some extra days before start)

    As said above, it is safer (more correct) to load a few days earlier, then truncate:

    y_symbols = ['GOOG', 'MSFT', 'TSLA']
    s, e = pd.Timestamp('2020-01-01'), pd.Timestamp('2022-12-01')
    price_feed = pdr.get_data_yahoo(y_symbols, start=s - pd.Timedelta('7 days'), end=e, interval='1d')
    
    def pctchg(price_feed, s, periods=(('MTD', 'M'), ('QTD', 'Q'), ('YTD', 'Y'))):
        gb = price_feed['Close'].shift().truncate(before=s).groupby
        return {
            name: 100 * (price_feed['Close'] - basis).dropna() / basis
            for name, freq in periods
            for basis in [gb(pd.Grouper(freq=freq)).transform('first')]
        }
    
    >>> pctchg(price_feed, s)['YTD']
                     GOOG       MSFT       TSLA
    Date                                       
    2020-01-02   2.269976   1.851616   2.851818
    2020-01-03   1.768110   0.583385   5.899652
    2020-01-06   4.277430   0.843375   7.938711
    ...               ...        ...        ...
    2022-11-28 -33.473645 -28.116083 -48.072448
    2022-11-29 -34.033502 -28.541271 -48.665759
    2022-11-30 -29.879496 -24.137728 -44.728328
    

    Addendum 2: performance

    All the calculations are vectorized, so we expect this to be fast. Let's check this (for the "safer" version above):

    %timeit pctchg(price_feed, s)
    # 11.6 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)