I have two datasets. One with supplies, other with sales. They have different dates and time.
Suppplies
Year Month Day Hour Item
0 2023 05 17 10 8
1 2023 06 01 12 8
2 2023 06 10 16 3
3 2023 06 17 10 8
4 2023 07 01 10 8
5 2023 08 17 10 8
Sales
Year Month Day Hour Sale
0 2023 05 17 16 3
1 2023 05 18 12 3
2 2023 05 24 16 3
3 2023 05 27 10 1
4 2023 06 02 10 2
5 2023 06 03 10 3
I need both information so i merged them
Year Month Day Hour Item Year Month Day Hour Item
0 2023 05 17 10 8 2023 05 17 16 3
1 2023 06 01 12 8 2023 05 18 12 3
2 2023 06 10 16 3 2023 05 24 16 3
3 2023 06 17 10 8 2023 05 27 10 1
4 2023 07 01 10 8 2023 06 02 10 2
5 2023 08 17 10 8 2023 06 03 10 3
I want that if there was no delivery on that day, then the date on the right is duplicated and the quantity is 0. Until a new delivery
I desire to get
Year Month Day Hour Item Year Month Day Hour Item
0 2023 05 17 10 8 2023 05 17 16 3
1 2023 NaN NaN NaN 0 2023 05 18 12 3
2 2023 NaN NaN NaN 0 2023 05 20 16 3
3 2023 NaN NaN NaN 0 2023 05 27 10 1
4 2023 06 01 12 8 2023 06 02 10 2
5 2023 NaN NaN NaN 0 2023 06 03 10 3
I would like to get this result, if the date value on the left is less than the date value on the right, and if the value in the column is 0 then it will be replaced by nan.
There are two ways how the two datasets can be combined in the desired way:
SQL
-like join on the supplies and sales dates. In particular, a particular sales record will be joined with the current supply record when the sales date is equal to/after the date of the current supply record, where the date of the sales record must not be greater than the date of the subsequent supply record.To the best of my knowledge, conditional joins (as they exist in SQL
) cannot be done in pandas
see this SO post here
Below you can find the code showing how both approaches can be done, where the first approach also requires the sqlite3
module. I personally would recommend you the first approach as performing a cross join can be computationally very expensive.
Code for first approach:
# loading data using solution from https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import numpy as np
import pandas as pd
import sqlite3
d = '''
Year Month Day Hour Item
0 2023 05 17 10 8
1 2023 06 01 12 8
2 2023 06 10 16 3
3 2023 06 17 10 8
4 2023 07 01 10 8
5 2023 08 17 10 8
'''
supplies_df = pd.read_csv(StringIO(d), sep='\s+')
d = '''
Year Month Day Hour Sale
0 2023 05 17 16 3
1 2023 05 18 12 3
2 2023 05 24 16 3
3 2023 05 27 10 1
4 2023 06 02 10 2
5 2023 06 03 10 3
'''
sales_df = pd.read_csv(StringIO(d), sep='\s+')
# first approach based on https://stackoverflow.com/a/42796283/8718701
supplies_df['supply_datetime'] = pd.to_datetime(
supplies_df[['Year', 'Month', 'Day', 'Hour']]
)
supplies_df['next_supply_datetime'] = supplies_df['supply_datetime'].shift(-1)
sales_df['sale_datetime'] = pd.to_datetime(
sales_df[['Year', 'Month', 'Day', 'Hour']]
)
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
supplies_df.to_sql('supplies', conn, index=False)
sales_df.to_sql('sales', conn, index=False)
# query joining tables on conditional join:
# join a sales record only if its datetime is equal to/greater than the datetime
# of the current supply record as well as smaller than the datetime of the
# subsequent supply record
qry = '''
SELECT
A.supply_datetime,
A.Year AS supply_year,
A.Month AS supply_month,
A.Day AS supply_day,
A.Hour AS supply_hour,
A.Item,
B.sale_datetime,
B.Year AS sale_year,
B.Month AS sale_month,
B.Day AS sale_day,
B.Hour AS sale_hour,
B.Sale
FROM supplies AS A
JOIN sales AS B
ON A.supply_datetime <= B.sale_datetime
AND A.next_supply_datetime > B.sale_datetime
'''
df = pd.read_sql_query(qry, conn)
# remove duplicated info
record_is_dupe = df.duplicated('supply_datetime')
cols_to_remove = [
'supply_datetime', 'supply_year',
'supply_month', 'supply_day', 'supply_hour'
]
df.loc[record_is_dupe, cols_to_remove] = np.NaN
df.loc[record_is_dupe, ['Item']] = 0
# remove datetime columns
df.drop(columns=['supply_datetime', 'sale_datetime'], inplace=True)
# matches expected output
print(df.to_markdown(index=False))
# | supply_year | supply_month | supply_day | supply_hour | Item | sale_year | sale_month | sale_day | sale_hour | Sale |
# |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
# | 2023 | 5 | 17 | 10 | 8 | 2023 | 5 | 17 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 18 | 12 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 24 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 27 | 10 | 1 |
# | 2023 | 6 | 1 | 12 | 8 | 2023 | 6 | 2 | 10 | 2 |
# | nan | nan | nan | nan | 0 | 2023 | 6 | 3 | 10 | 3 |
Code for second approach:
# loading data using solution from https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import numpy as np
import pandas as pd
d = '''
Year Month Day Hour Item
0 2023 05 17 10 8
1 2023 06 01 12 8
2 2023 06 10 16 3
3 2023 06 17 10 8
4 2023 07 01 10 8
5 2023 08 17 10 8
'''
supplies_df = pd.read_csv(StringIO(d), sep='\s+')
d = '''
Year Month Day Hour Sale
0 2023 05 17 16 3
1 2023 05 18 12 3
2 2023 05 24 16 3
3 2023 05 27 10 1
4 2023 06 02 10 2
5 2023 06 03 10 3
'''
sales_df = pd.read_csv(StringIO(d), sep='\s+')
# second approach based on https://stackoverflow.com/a/53699198/8718701
supplies_df['datetime'] = pd.to_datetime(
supplies_df[['Year', 'Month', 'Day', 'Hour']]
)
supplies_df['datetime_shift'] = supplies_df['datetime'].shift(-1)
supplies_df.columns = ['supply_' + col.lower() for col in supplies_df.columns]
# renaming columns as duplicate names cause problems
sales_df['datetime'] = pd.to_datetime(
sales_df[['Year', 'Month', 'Day', 'Hour']]
)
sales_df.columns = ['sale_' + col.lower() for col in sales_df.columns]
# Cartesian product
df = pd.merge(left=supplies_df, right=sales_df, how='cross')
# Filtering rows based on condition
winnowing_condition = (
((df['supply_datetime'] <= df['sale_datetime'])
& (df['supply_datetime_shift'] > df['sale_datetime']))
)
df = df.loc[winnowing_condition, :]
# remove duplicated info
record_is_dupe = df.duplicated('supply_datetime')
cols_to_remove = [
'supply_datetime', 'supply_year',
'supply_month', 'supply_day', 'supply_hour'
]
df.loc[record_is_dupe, cols_to_remove] = np.NaN
df.loc[record_is_dupe, ['supply_item']] = 0
# remove datetime columns
cols_to_drop = ['supply_datetime', 'sale_datetime', 'supply_datetime_shift']
df.drop(columns=cols_to_drop, inplace=True)
df.rename(columns={'supply_item': 'Item', 'sale_sale': 'Sale'}, inplace=True)
# matches expected output
print(df.to_markdown(index=False))
# | supply_year | supply_month | supply_day | supply_hour | Item | sale_year | sale_month | sale_day | sale_hour | Sale |
# |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
# | 2023 | 5 | 17 | 10 | 8 | 2023 | 5 | 17 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 18 | 12 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 24 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 27 | 10 | 1 |
# | 2023 | 6 | 1 | 12 | 8 | 2023 | 6 | 2 | 10 | 2 |
# | nan | nan | nan | nan | 0 | 2023 | 6 | 3 | 10 | 3 |