I had been searching for the answer to my question for a long time, looking it up in the documentation but have not found anything appropriate yet.
I am trying to merge two datasets: the first one with important statistics of tennis matches for 2000-2015, the second one with odds on tennis matches and some general statistics of them for 2001-2016.
The crucial difficulties:
Below one may see the datasets with some columns. The first dataset contains important stats about matches, the second one includes odds and general information.
matches[['Winner', 'Loser', 'Tournament Start Date']].head()
Winner Loser Tournament Start Date
0 Dupuis A. Ilie A. 2000-05-01
1 Gonzalez F. Mamiit C. 2000-05-01
2 Srichaphan P. Lareau S. 2000-05-01
3 Siemerink J. Gimelstob J. 2000-05-01
4 Stoltenberg J. Moron A. 2000-05-01
all_bets[['Winner', 'Loser', 'Date']].head()
Winner Loser Date of Match
0 Clement A. Gaudenzi A. 2001-01-01
1 Goldstein P. Jones A. 2001-01-01
2 Haas T. Smith L. 2001-01-01
3 Henman T. Rusedski G. 2001-01-01
4 Hewitt L. Arthurs W. 2001-01-01
As for me, the best way to merge the datasets accurately is on 3 conditions*: equalities of columns 'Winner', 'Loser' and... the difference between the date of a match and the tournament start date is not greater that 14 days because:
Usage of 'Winner' and 'Loser" conditions is obvious:
matches_bets = pd.merge(matches, all_bets, on=["Winner", "Loser"], how="inner")
However, I do not understand how to apply the inequality condition of dates to merge the datasets.
I would be grateful for any help in doing this task.
*Other columns are certainly not better to be used while merging the datasets - this is why one cannot see them here.
First, convert column with dates to datetime
matches['Tournament Start Date'] = pd.to_datetime(matches['Tournament Start Date'], format='%Y-%m-%d')
all_bets['Date of Match'] = pd.to_datetime(all_bets['Date of Match'], format='%Y-%m-%d')
Then you should merge as you did before
matches_bets = pd.merge(matches, all_bets, on=["Winner", "Loser"], how="inner")
Then apply your condition by extra column ['date_difference']
matches_bets['date_difference'] = matches_bets['Date of Match'] - matches_bets['Tournament Start Date']
matches_bets['KEEP_IN_PLACE'] = matches_bets['date_difference'].apply(lambda x: 1 if x.days <= 14 else 0)
matches_bets = matches_bets[matches_bets['KEEP_IN_PLACE'] == 1]
And you're done =)