Search code examples
python-3.xparallel-processingmultiprocessingdaskdask-delayed

How can i execute a certain function on each sheet of a xlsx file having more that 100 sheets in a parallel way?


I have an xlsx file File.xlsx, it has more than 100 sheets.

I need to perform a certain function f() on each sheet's data and finally return a list of each sheet's operated appended together.

I tried using pandas, reading each sheet's data one by one and then applying the function on it and appending in a list. Which is taking a lot of time. Need to the reduce the operation time.

How can I make the sheet wise execution in parallel? Can i use DASK aor anything else?

Note: need to jsonify each sheet's data so using df.to_json()


Solution

  • How can I make the sheet wise execution in parallel? Can i use DASK aor anything else?

    This is a trivial usage of Dask:

    import dask
    import pandas as pd
    
    @dask.delayed
    def get_sheet(filename, sheet_index=0):
        return pd.read_excel(filename, sheet_name=sheet_index)
    
    @dask.delayed
    def process(df: pd.DataFrame) -> pd.DataFrame:
        """
        Inputs
        ------
        df : pd.DataFrame
            A Pandas DataFrame. For this example, this DataFrame represents on sheet.
    
        Returns
        -------
        out : pd.DataFrame
            A new dataframe that makes some modifications on the original sheet.
        """
        out = df.copy()
        out["foo"] = "bar"
        return out
    
    if __name__ == "__main__":
        # Get the sheets of the Excel file (test.xlsx has two sheets)
        future_dfs = [get_sheet("test.xlsx", sheet_index=i) for i in [0, 1]]
    
        # Process the sheets
        processed_dfs = [process(df) for df in future_dfs]
    
        # Now that we've detailed the computation, start the computation.
        dfs = dask.compute(processed_dfs)
    

    The function process should not modify the input DataFrame because functions should be pure and not modify the inputs. See the docstring of dask.delayed for more detail: https://docs.dask.org/en/latest/delayed-api.html#dask.delayed.delayed