Search code examples
python-3.xdataframepython-multiprocessingpandas-datareader

Multiprocessing or Multi threading in python to download files


I have a csv file which contains the list of symbols I wish to pull from provider (about 6000 of them). It takes almost 3 hours to download the whole symbol list and save it to csv. Takes about 3-4 sec to download each symbol.

I'm wondering, would it be possible / quicker to use multiprocessing / hyper threading to quicken this process?

What would be the correct way to apply Multi-process or Multi-threading to speed up the process ?

 def f():
    for ticker in tickers:
        df = get_eod_data(ticker, ex,api_key='xxxxxxxxxxxxxxxxxxx')
        df.columns = ['Open','High','Low','Close','Adj close','Volume']
        df.to_csv('Path\\to\\file\\{}.csv'.format(ticker))


p = Pool(20)
p.map(f)

Thanks !!


Solution

  • Upon a little research, I think this is the best way to go :

    x = ['1','2','3','4','5','6', ..... '3000']
    
    def f(x):
        df = get_eod_data(ticker, ex,api_key='xxxxxxxxxxxxxxxxxxx')
        df.columns = ['Open','High','Low','Close','Adj close','Volume']
        df.to_csv('Path\\to\\file\\{}.csv'.format(ticker))
    
    def mp_handler_1():
        p1 = multiprocessing.Pool(10)
        p1.map(f, x)
    
    if __name__ == '__main__':
        mp_handler_1()
    

    From the original 3 - 4 hours that it took to download all symbols, using multiprocessing.Pool it took 35 - 40 min !! It created 10 python processes and processed the function in parallel, with no data loss or corruption. The only downside , if this requires more memory than is available then you will get a MemoryError.