I have a dataframe where the index is a datetimeindex
, and every row is every day over the course of a couple years. I need to resample the dataframe by month where the two float columns are summed, but the string columns are all the unique values during that month. I can do the resampling to a single column, but I don't know how to do it to everything, or how to combine them back together if I do it one at a time.
For the floats I was trying:
# go through the column list
for col in col_list:
# process all run time columns for month
if "float" in str(col):
# resample for one month and sum
df[col] = df[col].resample('M').sum()
# rename the column
df.rename(columns={col: col + " MONTHLY"}, inplace=True)
and for the strings:
elif "string" in str(col):
# get all the unique jobs run during the month
df[col] = df[col].groupby(pd.Grouper(freq='M')).unique()
df.rename(columns={col: col + " MONTHLY"}, inplace=True)
these were resulting in the monthly data being inserted into tho the dataframe with every day still existing though, and was hard to find and not what I need.
Some sample data:
float_1 float_2 string_1 string_2
12/30/2019 1 2 a a
12/31/2019 1 3 a b
1/1/2020 2 4 a c
1/2/2020 3 5 b d
The expected output would be:
12/2019 2 5 a a, b
1/2020 5 9 a, b c, d
Not sure if it matters but the real data does have NaN in random days throughout the data.
Try aggregate numeric columns and non numeric columns separately and then join them back:
df.index = pd.to_datetime(df.index)
numerics = df.select_dtypes('number').resample('M').sum()
strings = df.select_dtypes('object').resample('M').agg(lambda x: ','.join(set(x)))
numerics.join(strings)
# float_1 float_2 string_1 string_2
#2019-12-31 2 5 a a,b
#2020-01-31 5 9 a,b d,c