I have two columns a min_date and a max_date in dataframe where each row corresponds to a unique products and 3 levels of product hierarchy(product group), I have taken the difference between them to find the number of days in between(date_diff). Now , I want to see how many products are falling in different buckets. Say , how many products from group 04 have date_diff of more than 180 days,how many products from group 04 have date_diff of more than 150 days and less than 180 days ..and so on I will have 7 buckets of date_diff from 0-30 days difference to more than 180 days difference.
I am trying the following code :
check_df=pd.DataFrame()
for i in range(0,170331) :
if (max_days_by_order.date_diff[i] > 160) :
check_df[i] = max_days_by_order.iloc[i]
check_df
I am getting this error :
'>' not supported between instances of 'Timedelta' and 'int'
my dataframe looks like this
EDIT: Regarding your additional question:
Sample dataframe (simplified - please always add little samples to your question which can be copied, not screenshots):
df = pd.DataFrame({
'prod_loc': range(10),
'code_1': ['01'] * 5 + ['02'] * 5,
'code_2': ['001'] * 3 + ['002'] * 3 + ['003'] * 4,
'min_date': pd.to_datetime(['2021-07-22'] * 10),
'max_date': pd.date_range('2021-07-22', periods=10, freq='25d')
})
df['date_diff'] = df.max_date - df.min_date
prod_loc code_1 code_2 min_date max_date date_diff
0 0 01 001 2021-07-22 2021-07-22 0 days
1 1 01 001 2021-07-22 2021-08-16 25 days
2 2 01 001 2021-07-22 2021-09-10 50 days
3 3 01 002 2021-07-22 2021-10-05 75 days
4 4 01 002 2021-07-22 2021-10-30 100 days
5 5 02 002 2021-07-22 2021-11-24 125 days
6 6 02 003 2021-07-22 2021-12-19 150 days
7 7 02 003 2021-07-22 2022-01-13 175 days
8 8 02 003 2021-07-22 2022-02-07 200 days
9 9 02 003 2021-07-22 2022-03-04 225 days
First step: Setting up buckets (you would choose others) and pd.cut
-ing the diff_days
-column with them:
buckets = list(range(0, 181, 50)) + [df.date_diff.max().days + 1]
cut = pd.cut(df.date_diff.dt.days, buckets, right=False)
And then, second step, do
result = df.groupby(['code_1', cut]).prod_loc.count().unstack(1)
which yields
date_diff [0, 50) [50, 100) [100, 150) [150, 226)
code_1
01 2 2 1 0
02 0 0 1 4
or
result = df.groupby(['code_1', 'code_2', cut]).prod_loc.count().unstack(2)
which yields
date_diff [0, 50) [50, 100) [100, 150) [150, 226)
code_1 code_2
01 001 2 1 0 0
002 0 1 1 0
003 0 0 0 0
02 001 0 0 0 0
002 0 0 1 0
003 0 0 0 4
You don't need to unstack
if you prefer a longer view.
You can also try
df['buckets'] = cut
result = df.pivot_table(index=['code_1'], columns='buckets',
values='prod_loc', aggfunc='count')
result = df.pivot_table(index=['code_1', 'code_2'], columns='buckets',
values='prod_loc', aggfunc='count')
Is this what you are looking for?
Btw.: Don't iterate over dataframes, except you absolutely have to. Use the native Pandas methods. For example, for
max_days_by_order = pd.DataFrame({
'min_date': pd.to_datetime(['2021-07-21', '2021-07-22']),
'max_date': pd.to_datetime(['2021-10-21', '2022-07-22'])
})
max_days_by_order['date_diff'] = (max_days_by_order.max_date
- max_days_by_order.min_date)
min_date max_date date_diff
0 2021-07-21 2021-10-21 92 days
1 2021-07-22 2022-07-22 365 days
this
check_df = max_days_by_order.date_diff.where(
max_days_by_order.date_diff.dt.days > 180
)
produces
0 NaT
1 365 days
Name: date_diff, dtype: timedelta64[ns]
Which seems to be what you are trying to achieve? (I don't have the full picture, so I might have missed something.)