Search code examples
pythonpandasdataframecountrow

calculate the queue for orders based on creation and delivery date, by product group


I have a Pandas dataframe containing records for a lot of orders, one recorde for each order. Each record has order_id, category_id, created_at and picked_at. I need to calculate queue length for each order at the time of it's creation. Which means for each record current_order I need to count the number of rows with following conditions:

  • must have the same category_id as the current_order
  • must be created before created_at of the current_order
  • must be picked after created_at of the current_order

The dataframe is quite larg hence doing the calculation using a loop is too time consuming. How can I do this faster?

Any help would be greatly appreciated.

Edited

A sample of dataframe:

          id  category_id          created_at           picked_at
0  123228779        69558 2021-05-22 00:08:46 2021-05-22 00:22:45
1  123228972        69558 2021-05-22 00:12:39 2021-05-22 00:17:00
2  123229120         6725 2021-05-22 00:15:47 2021-05-22 00:42:50
3  123229210        41358 2021-05-22 00:17:44 2021-05-22 00:35:34
4  123229152         6725 2021-05-22 00:16:29 2021-05-22 01:05:43

Solution

  • Let's first start by reshaping the dataframe to have created_at and picked_at in the same column. Then we calculate the queue value.

    df2 = (df.melt(id_vars=['id', 'category_id'],
                   var_name='type',
                   value_name='time')
             .sort_values(by=['category_id', 'time']) # not required to sort by "category_id",
                                                      # but done here for clarity
          )
    
    df2['queue'] = (df2['type'].map({'created_at': 1, 'picked_at': -1})
                               .cumsum()
                   )
    
    >>> df2
              id  category_id        type                time  queue
    2  123229120         6725  created_at 2021-05-22 00:15:47      1
    4  123229152         6725  created_at 2021-05-22 00:16:29      2
    7  123229120         6725   picked_at 2021-05-22 00:42:50      1
    9  123229152         6725   picked_at 2021-05-22 01:05:43      0
    3  123229210        41358  created_at 2021-05-22 00:17:44      1
    8  123229210        41358   picked_at 2021-05-22 00:35:34      0
    0  123228779        69558  created_at 2021-05-22 00:08:46      1
    1  123228972        69558  created_at 2021-05-22 00:12:39      2
    6  123228972        69558   picked_at 2021-05-22 00:17:00      1
    5  123228779        69558   picked_at 2021-05-22 00:22:45      0
    

    Finally, we reshape the queue to the original dataframe:

    df['queue'] = (df2.pivot(columns=['type'],
                             values=['queue'])
                      .loc[:, ('queue', 'created_at')]
                      .dropna()
                      .astype(int)
                  )
    

    output:

              id  category_id          created_at           picked_at  queue
    0  123228779        69558 2021-05-22 00:08:46 2021-05-22 00:22:45      1
    1  123228972        69558 2021-05-22 00:12:39 2021-05-22 00:17:00      2
    2  123229120         6725 2021-05-22 00:15:47 2021-05-22 00:42:50      1
    3  123229210        41358 2021-05-22 00:17:44 2021-05-22 00:35:34      1
    4  123229152         6725 2021-05-22 00:16:29 2021-05-22 01:05:43      2
    

    NB. this gives us the queue, per category_id, after creation.