Search code examples
pythonpandasdataframemergecross-join

How can I align a merged dataset?


I have two datasets. One with supplies, other with sales. They have different dates and time.

Suppplies

         Year      Month   Day  Hour       Item
    0    2023       05     17    10         8 
    1    2023       06     01    12         8 
    2    2023       06     10    16         3
    3    2023       06     17    10         8 
    4    2023       07     01    10         8 
    5    2023       08     17    10         8 

Sales

         Year      Month   Day  Hour       Sale
    0    2023       05     17    16         3 
    1    2023       05     18    12         3 
    2    2023       05     24    16         3 
    3    2023       05     27    10         1 
    4    2023       06     02    10         2 
    5    2023       06     03    10         3 

I need both information so i merged them

         Year      Month   Day  Hour       Item    Year      Month   Day  Hour       Item  
    0    2023       05     17    10         8      2023       05     17    16         3
    1    2023       06     01    12         8      2023       05     18    12         3
    2    2023       06     10    16         3      2023       05     24    16         3
    3    2023       06     17    10         8      2023       05     27    10         1
    4    2023       07     01    10         8      2023       06     02    10         2
    5    2023       08     17    10         8      2023       06     03    10    3 

I want that if there was no delivery on that day, then the date on the right is duplicated and the quantity is 0. Until a new delivery

I desire to get

         Year      Month   Day  Hour       Item    Year      Month   Day  Hour       Item  
    0    2023       05     17    10         8      2023       05     17    16         3
    1    2023      NaN    NaN    NaN        0      2023       05     18    12         3
    2    2023       NaN    NaN    NaN       0      2023       05     20    16         3
    3    2023      NaN    NaN    NaN        0      2023       05     27    10         1
    4    2023       06     01    12         8      2023       06     02    10         2
    5    2023      NaN    NaN    NaN        0      2023       06     03    10         3 

I would like to get this result, if the date value on the left is less than the date value on the right, and if the value in the column is 0 then it will be replaced by nan.


Solution

  • There are two ways how the two datasets can be combined in the desired way:

    1. Conditional, SQL-like join on the supplies and sales dates. In particular, a particular sales record will be joined with the current supply record when the sales date is equal to/after the date of the current supply record, where the date of the sales record must not be greater than the date of the subsequent supply record.
    2. Cartesian product of both tables with subsequent filter. We cross-join all rows of both tables and subsequently filter the data as mentioned in 1.

    To the best of my knowledge, conditional joins (as they exist in SQL) cannot be done in pandas see this SO post here

    Below you can find the code showing how both approaches can be done, where the first approach also requires the sqlite3 module. I personally would recommend you the first approach as performing a cross join can be computationally very expensive.

    Code for first approach:

    # loading data using solution from https://stackoverflow.com/a/53692642/8718701
    from io import StringIO
    
    import numpy as np
    import pandas as pd
    import sqlite3
    
    d = '''
             Year      Month   Day  Hour       Item
        0    2023       05     17    10         8
        1    2023       06     01    12         8
        2    2023       06     10    16         3
        3    2023       06     17    10         8
        4    2023       07     01    10         8
        5    2023       08     17    10         8
    '''
    
    supplies_df = pd.read_csv(StringIO(d), sep='\s+')
    
    d = '''
             Year      Month   Day  Hour       Sale
        0    2023       05     17    16         3
        1    2023       05     18    12         3
        2    2023       05     24    16         3
        3    2023       05     27    10         1
        4    2023       06     02    10         2
        5    2023       06     03    10         3
    '''
    
    sales_df = pd.read_csv(StringIO(d), sep='\s+')
    
    # first approach based on https://stackoverflow.com/a/42796283/8718701
    
    supplies_df['supply_datetime'] = pd.to_datetime(
        supplies_df[['Year', 'Month', 'Day', 'Hour']]
    )
    supplies_df['next_supply_datetime'] = supplies_df['supply_datetime'].shift(-1)
    
    sales_df['sale_datetime'] = pd.to_datetime(
        sales_df[['Year', 'Month', 'Day', 'Hour']]
    )
    
    #Make the db in memory
    conn = sqlite3.connect(':memory:')
    #write the tables
    supplies_df.to_sql('supplies', conn, index=False)
    sales_df.to_sql('sales', conn, index=False)
    
    # query joining tables on conditional join:
    # join a sales record only if its datetime is equal to/greater than the datetime
    # of the current supply record as well as smaller than the datetime of the
    # subsequent supply record
    qry = '''
        SELECT 
            A.supply_datetime,
            A.Year AS supply_year,
            A.Month AS supply_month,
            A.Day AS supply_day,
            A.Hour AS supply_hour,
            A.Item,
            B.sale_datetime,
            B.Year AS sale_year,
            B.Month AS sale_month,
            B.Day AS sale_day,
            B.Hour AS sale_hour,
            B.Sale
        FROM supplies AS A
        JOIN sales AS B
        ON A.supply_datetime <= B.sale_datetime
            AND A.next_supply_datetime > B.sale_datetime
        '''
    df = pd.read_sql_query(qry, conn)
    
    # remove duplicated info
    record_is_dupe = df.duplicated('supply_datetime')
    cols_to_remove = [
        'supply_datetime', 'supply_year',
        'supply_month', 'supply_day', 'supply_hour'
    ]
    df.loc[record_is_dupe, cols_to_remove] = np.NaN
    df.loc[record_is_dupe, ['Item']] = 0
    
    # remove datetime columns
    df.drop(columns=['supply_datetime', 'sale_datetime'], inplace=True)
    
    # matches expected output
    print(df.to_markdown(index=False))
    
    # |   supply_year |   supply_month |   supply_day |   supply_hour |   Item |   sale_year |   sale_month |   sale_day |   sale_hour |   Sale |
    # |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
    # |          2023 |              5 |           17 |            10 |      8 |        2023 |            5 |         17 |          16 |      3 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         18 |          12 |      3 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         24 |          16 |      3 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         27 |          10 |      1 |
    # |          2023 |              6 |            1 |            12 |      8 |        2023 |            6 |          2 |          10 |      2 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            6 |          3 |          10 |      3 |
    

    Code for second approach:

    # loading data using solution from https://stackoverflow.com/a/53692642/8718701
    from io import StringIO
    
    import numpy as np
    import pandas as pd
    
    d = '''
             Year      Month   Day  Hour       Item
        0    2023       05     17    10         8
        1    2023       06     01    12         8
        2    2023       06     10    16         3
        3    2023       06     17    10         8
        4    2023       07     01    10         8
        5    2023       08     17    10         8
    '''
    
    supplies_df = pd.read_csv(StringIO(d), sep='\s+')
    
    d = '''
             Year      Month   Day  Hour       Sale
        0    2023       05     17    16         3
        1    2023       05     18    12         3
        2    2023       05     24    16         3
        3    2023       05     27    10         1
        4    2023       06     02    10         2
        5    2023       06     03    10         3
    '''
    
    sales_df = pd.read_csv(StringIO(d), sep='\s+')
    
    # second approach based on https://stackoverflow.com/a/53699198/8718701
    
    supplies_df['datetime'] = pd.to_datetime(
        supplies_df[['Year', 'Month', 'Day', 'Hour']]
    )
    supplies_df['datetime_shift'] = supplies_df['datetime'].shift(-1)
    supplies_df.columns = ['supply_' + col.lower() for col in supplies_df.columns]
    
    # renaming columns as duplicate names cause problems
    sales_df['datetime'] = pd.to_datetime(
        sales_df[['Year', 'Month', 'Day', 'Hour']]
    )
    sales_df.columns = ['sale_' + col.lower() for col in sales_df.columns]
    
    # Cartesian product
    df = pd.merge(left=supplies_df, right=sales_df, how='cross')
    
    # Filtering rows based on condition
    winnowing_condition = (
        ((df['supply_datetime'] <= df['sale_datetime'])
        & (df['supply_datetime_shift'] > df['sale_datetime']))
    )
    df = df.loc[winnowing_condition, :]
    
    # remove duplicated info
    record_is_dupe = df.duplicated('supply_datetime')
    cols_to_remove = [
        'supply_datetime', 'supply_year',
        'supply_month', 'supply_day', 'supply_hour'
    ]
    df.loc[record_is_dupe, cols_to_remove] = np.NaN
    df.loc[record_is_dupe, ['supply_item']] = 0
    
    # remove datetime columns
    cols_to_drop = ['supply_datetime', 'sale_datetime', 'supply_datetime_shift']
    df.drop(columns=cols_to_drop, inplace=True)
    df.rename(columns={'supply_item': 'Item', 'sale_sale': 'Sale'}, inplace=True)
    
    # matches expected output
    print(df.to_markdown(index=False))
    
    
    # |   supply_year |   supply_month |   supply_day |   supply_hour |   Item |   sale_year |   sale_month |   sale_day |   sale_hour |   Sale |
    # |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
    # |          2023 |              5 |           17 |            10 |      8 |        2023 |            5 |         17 |          16 |      3 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         18 |          12 |      3 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         24 |          16 |      3 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         27 |          10 |      1 |
    # |          2023 |              6 |            1 |            12 |      8 |        2023 |            6 |          2 |          10 |      2 |
    # |           nan |            nan |          nan |           nan |      0 |        2023 |            6 |          3 |          10 |      3 |