Search code examples
pythonpandasdataframedaskdask-dataframe

Dask - length mismatch when querying


I am trying to import lot of csv's into a single dataframe and would like to filter the data after a specific date.

Its throwing below error not sure what's wrong.

Is it because there is a mismatch in columns? If yes is there a way to read all csv and perform a union in such a way that the dataframe would have all column names and doesnt show below error.

import dask.dataframe as dd
df = dd.read_csv('XXXXXXX*.csv',assume_missing=True)
df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))
filter_t=df_req[df_req['time']>='2020-11-21 21:22:19']
filter_t.head(npartitions=-1)

enter image description here


Solution

  • It's not clear from the question, but if there's a mismatch in columns, then using dd.read_csv is not appropriate. One option is to write a custom delayed wrapper to enforce a specific column structure. This would roughly look like this:

    # this is the list of columns that the final dataframe should contain
    list_all_columns = ['a', 'b', 'c'] 
    
    from dask import delayed
    @delayed
    def load_csv(f):
        df = pd.read_csv(f)
        for c in list_all_columns:
            if c not in df.columns:
                 df[c] = np.nan
        return df
    
    ddf = dd.from_delayed([load_csv(f) for f in glob('x*csv')])
    
    # the rest of your workflow continues