Search code examples
pythonpandasvectorizationapplydatareader

Python: Pandas: Speeding up an Apply Function


I am trying to do a pandas apply function on a 33 MB dataframe (in CSV form) and it is going incredibly slow. And I am trying out figure out why. I was doing an apply on a much bigger dataframe (16 GB) and it finished in about 6 hours. This function is operating on a much, much smaller dataframe and I let it run for 1.5 hours and still nothing.

I am trying to figure out where the bottleneck could be. My suspicion is that because I am using the datareader function which pings yahoo or google finance that could be why it is so much slower. But when I do a sample with it for a few stocks it seems pretty instantaneous.

Does anyone have any thoughts on this? Or ways to make it faster? I have considered cythoning it but if the bottleneck is the ping time that wont speed things up too much. Or better yet is there a way to vectorize this? (I can't see a way but some people are much smarter than I am :) ) That's a lot of questions but basically I am just looking for suggestions to make this run faster. Thanks!

PS- also if someone knows how to add a progress bar on an apply function that would be a great added bonus :) Thanks again!

data4=pd.read_csv('check2.csv', parse_dates=['dater1','dater2'], infer_datetime_format=True)

def nextweekday(date):
    day=date.weekday()
    if day==4:
        return date+datetime.timedelta(days=3)
    if day==5:
        return date+datetime.timedelta(days=2)
    else:
        return date+datetime.timedelta(days=1)

def getquote(tick,date,plus):
    date=date+datetime.timedelta(days=plus)
    nextday=nextweekday(date)
    try:
        return DataReader(tick, "yahoo",date, nextday)["Close"]
    except:
        return "NO"

def apply_days5(row):
    return getquote(row['AcquirorTickerSymbol'],row['dater2'],5)

data4['days5']=data4.apply(apply_days5, axis=1)

Solution

  • I'm not 100% sure what you're trying to accomplish, but a few ideas.

    First, repeatedly hitting yahoo like that adds a lot unnecessary overhead. I'd probably do something like this, reading all the stock data into one dataframe.

    In [83]: tickers = data4['AcquirorTickerSymbol'].unique()
    
    In [84]: min_date = data4['dater2'].min()
        ...: max_date = data4['dater2'].max()
        ...:     
        ...: dfs = []
        ...: for ticker in tickers:
        ...:     df = DataReader(ticker, 'yahoo', min_date, max_date)[['Close']]
        ...:     df['AcquirorTickerSymbol'] = ticker
        ...:     df['dater2'] = df.index
        ...:     dfs.append(df)
    
    In [85]: stock_df = pd.concat(dfs, ignore_index=True)
    

    Then, rather than using apply, you could merge your existing data against the stock df, something like this:

    In [92]: data4 = data4.merge(stock_df, how='left')
    

    If you want to fill missing values, rather than having custom logic in apply, it's much faster use fillna

    In [94]: data4['Close'] = data4['Close'].fillna('NO')