Search code examples
pythonpandasfinance

Improve running time when using inflation method on pandas


I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.

Portion of df:
    year    quarter fare
0   1994    1      213.98
1   1994    1      214.00   
2   1994    1      214.00
3   1994    1      214.50 
4   1994    1      214.50   
import cpi
import pandas as pd

def inflate_column(data, column):
    """
    Adjust for inflation the series of values in column of the   
    dataframe data. Using cpi library.
    """
    print('Beginning to inflate ' + column)
    start_time = time.time()
    
    df = data.apply(lambda x: cpi.inflate(x[column], 
                      x.year), axis=1)
    
    print("Inflating process took", time.time() - start_time, " seconds to run")  
    return df

df['real_fare'] = inflate_column(df, 'fare')

Solution

  • You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.

    all_years = df["year"].unique()
    dict_years = {}
    for year in all_years:
        dict_years[year] = cpi.inflate(1.0, year)
    
    df['real_fare'] = # apply here: dict_years[row['year']]*row['fare'] 
    

    You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...