Search code examples
pythondaskxlsb

how to read fast xlsb files with dask any libraries of python


I want to read a large xlsb files with python. But, I can't find any solutions at all. I tried Dask, but it had no function to read it, or excel files. Moreover, I used the function "delayed", but I wasn't fast. It was same as the function of pyxlsb.

import dask as dd
import pyxlsb as xls

def read_xlsb(file: list):   ## file is list of the paths of the files to read
    for num in range(len(file)):
        with xls.open_workbook(num) as wb:
            sheet_cnt = wb.sheet_names()

            for cnt in range(len(sheet_cnt)):
                with wb.get_sheet(cnt) as sheet:
                    for row in sheet.rows():
                        data = [item.v for item in row]

ddf = dd.delayed(read_xlsb)
result = ddf.compute()

I tried Dask, but I didn't work well. How can I read the xlsb files fast?


Solution

  • I used the function "delayed", but I wasn't fast.

    Your function is pure-python and CPU bound with lots of internal looping. This means that, if you have multiple threads running this function in a process, you will see no speedup because only one thread can hold the GIL at a time and make progress. In fact, you may see a decrease in performance due to the overhead of dealing with the threads.

    If used processes, then tasks can run truly in parallel. The best way to achieve that is to use the "distributed" scheduler, even on a single machine.

    Note that your snippet ends with .compute(). It doesn't appear like your function actually returns or otherwise does anything with the data, but if your end goal is to have a single pandas dataframe, and everything fits in memory, then probably using pandas directly for IO is best, and only use dask when you need it. I also note that you never call your delayed function, so probably you are not showing the actual code you ran which was slow.

    -edit-

    actually, your function iterates over all files, so there would only ever be one task and no opportunity for parallelism at all.