Search code examples
pandasdaskdask-distributeddask-dataframe

Dask: How to create 10000 columns in a dask dataframe with improved performance?


I have a dask dataframe and I would like to add 10000 columns to it. Below is what I tried,

series_dict = {}
for i in range(0,10000):
    series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
df.assign(**series_dict)

However, it just hangs before on assign itself. How to improve it?

Note: This is simplified lambda function but in real case, I will have complicated functions


Solution

  • In your example, each of your series is made up or operations on existing series, requiring fragments of task graph to be generated. You are far better using a single task generator and operating no the contained pandas dataframes:

    def create_columns(df):
        series_dict = {}
        for i in range(0,10000):
            series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
        return df.assign(**series_dict)
    
    new_df = df.map_partitions(create_columns)