Search code examples
pythonpandasdataframedatetimetimedelta

Using timedelta in loop


I have two columns a min_date and a max_date in dataframe where each row corresponds to a unique products and 3 levels of product hierarchy(product group), I have taken the difference between them to find the number of days in between(date_diff). Now , I want to see how many products are falling in different buckets. Say , how many products from group 04 have date_diff of more than 180 days,how many products from group 04 have date_diff of more than 150 days and less than 180 days ..and so on I will have 7 buckets of date_diff from 0-30 days difference to more than 180 days difference.

I am trying the following code :

    check_df=pd.DataFrame()
    for i in range(0,170331) :
         if (max_days_by_order.date_diff[i] > 160) :
                check_df[i] =  max_days_by_order.iloc[i]
        
    check_df  

I am getting this error :

'>' not supported between instances of 'Timedelta' and 'int'

my dataframe looks like this
enter image description here


Solution

  • EDIT: Regarding your additional question:

    Sample dataframe (simplified - please always add little samples to your question which can be copied, not screenshots):

    df = pd.DataFrame({
        'prod_loc': range(10),
        'code_1': ['01'] * 5 + ['02'] * 5,
        'code_2': ['001'] * 3 + ['002'] * 3 + ['003'] * 4,
        'min_date': pd.to_datetime(['2021-07-22'] * 10),
        'max_date': pd.date_range('2021-07-22', periods=10, freq='25d')
    })
    df['date_diff'] = df.max_date - df.min_date
    
       prod_loc code_1 code_2   min_date   max_date date_diff
    0         0     01    001 2021-07-22 2021-07-22    0 days
    1         1     01    001 2021-07-22 2021-08-16   25 days
    2         2     01    001 2021-07-22 2021-09-10   50 days
    3         3     01    002 2021-07-22 2021-10-05   75 days
    4         4     01    002 2021-07-22 2021-10-30  100 days
    5         5     02    002 2021-07-22 2021-11-24  125 days
    6         6     02    003 2021-07-22 2021-12-19  150 days
    7         7     02    003 2021-07-22 2022-01-13  175 days
    8         8     02    003 2021-07-22 2022-02-07  200 days
    9         9     02    003 2021-07-22 2022-03-04  225 days
    

    First step: Setting up buckets (you would choose others) and pd.cut-ing the diff_days-column with them:

    buckets = list(range(0, 181, 50)) + [df.date_diff.max().days + 1]
    cut = pd.cut(df.date_diff.dt.days, buckets, right=False)
    

    And then, second step, do

    result = df.groupby(['code_1', cut]).prod_loc.count().unstack(1)
    

    which yields

    date_diff  [0, 50)  [50, 100)  [100, 150)  [150, 226)
    code_1                                               
    01               2          2           1           0
    02               0          0           1           4
    

    or

    result = df.groupby(['code_1', 'code_2', cut]).prod_loc.count().unstack(2)
    

    which yields

    date_diff      [0, 50)  [50, 100)  [100, 150)  [150, 226)
    code_1 code_2                                            
    01     001           2          1           0           0
           002           0          1           1           0
           003           0          0           0           0
    02     001           0          0           0           0
           002           0          0           1           0
           003           0          0           0           4
    

    You don't need to unstack if you prefer a longer view.

    You can also try

    df['buckets'] = cut
    result = df.pivot_table(index=['code_1'], columns='buckets',
                            values='prod_loc', aggfunc='count')
    result = df.pivot_table(index=['code_1', 'code_2'], columns='buckets',
                            values='prod_loc', aggfunc='count')
    

    Is this what you are looking for?


    Btw.: Don't iterate over dataframes, except you absolutely have to. Use the native Pandas methods. For example, for

    max_days_by_order = pd.DataFrame({
        'min_date': pd.to_datetime(['2021-07-21', '2021-07-22']),
        'max_date': pd.to_datetime(['2021-10-21', '2022-07-22'])
    })
    max_days_by_order['date_diff'] = (max_days_by_order.max_date
                                      - max_days_by_order.min_date)
    
        min_date   max_date date_diff
    0 2021-07-21 2021-10-21   92 days
    1 2021-07-22 2022-07-22  365 days
    

    this

    check_df = max_days_by_order.date_diff.where(
                    max_days_by_order.date_diff.dt.days > 180
               )
    

    produces

    0        NaT
    1   365 days
    Name: date_diff, dtype: timedelta64[ns]
    

    Which seems to be what you are trying to achieve? (I don't have the full picture, so I might have missed something.)