I am trying to import lot of csv's into a single dataframe and would like to filter the data after a specific date.
Its throwing below error not sure what's wrong.
Is it because there is a mismatch in columns? If yes is there a way to read all csv and perform a union in such a way that the dataframe would have all column names and doesnt show below error.
import dask.dataframe as dd
df = dd.read_csv('XXXXXXX*.csv',assume_missing=True)
df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))
filter_t=df_req[df_req['time']>='2020-11-21 21:22:19']
filter_t.head(npartitions=-1)
It's not clear from the question, but if there's a mismatch in columns, then using
dd.read_csv
is not appropriate. One option is to write a custom delayed wrapper to enforce a specific column structure. This would roughly look like this:
# this is the list of columns that the final dataframe should contain
list_all_columns = ['a', 'b', 'c']
from dask import delayed
@delayed
def load_csv(f):
df = pd.read_csv(f)
for c in list_all_columns:
if c not in df.columns:
df[c] = np.nan
return df
ddf = dd.from_delayed([load_csv(f) for f in glob('x*csv')])
# the rest of your workflow continues