Search code examples
pythonpython-3.xpandasmergedata-analysis

Merging two tennis datasets on datetime condition


I had been searching for the answer to my question for a long time, looking it up in the documentation but have not found anything appropriate yet.

I am trying to merge two datasets: the first one with important statistics of tennis matches for 2000-2015, the second one with odds on tennis matches and some general statistics of them for 2001-2016.

The crucial difficulties:

  1. Names of similar tournaments are often not equal in the datasets so it is impossible to use them in order to merge the datasets.
  2. In the dataset with important statistics, there is only date of tournament start whereas in the other there are dates of every match. Therefore, just equality of dates does not work in general.

Below one may see the datasets with some columns. The first dataset contains important stats about matches, the second one includes odds and general information.

matches[['Winner', 'Loser', 'Tournament Start Date']].head()

         Winner       Loser        Tournament Start Date
     0 Dupuis A.       Ilie A.       2000-05-01
     1 Gonzalez F.     Mamiit C.     2000-05-01
     2 Srichaphan P.   Lareau S.     2000-05-01
     3 Siemerink J.    Gimelstob J.  2000-05-01
     4 Stoltenberg J.  Moron A.      2000-05-01


all_bets[['Winner', 'Loser', 'Date']].head()

          Winner        Loser       Date of Match
     0  Clement A.    Gaudenzi A.    2001-01-01 
     1  Goldstein P.  Jones A.       2001-01-01 
     2  Haas T.       Smith L.       2001-01-01 
     3  Henman T.     Rusedski G.    2001-01-01 
     4  Hewitt L.     Arthurs W.     2001-01-01

As for me, the best way to merge the datasets accurately is on 3 conditions*: equalities of columns 'Winner', 'Loser' and... the difference between the date of a match and the tournament start date is not greater that 14 days because:

  1. Two players can play against each other only once in one tournament
  2. Each tournament occurs only once a year
  3. There are no tournaments with a duration longer than 14 days.

Usage of 'Winner' and 'Loser" conditions is obvious:

matches_bets = pd.merge(matches, all_bets, on=["Winner", "Loser"], how="inner")

However, I do not understand how to apply the inequality condition of dates to merge the datasets.

I would be grateful for any help in doing this task.

*Other columns are certainly not better to be used while merging the datasets - this is why one cannot see them here.


Solution

  • First, convert column with dates to datetime

    matches['Tournament Start Date'] = pd.to_datetime(matches['Tournament Start Date'], format='%Y-%m-%d')
    all_bets['Date of Match'] = pd.to_datetime(all_bets['Date of Match'], format='%Y-%m-%d')
    

    Then you should merge as you did before

    matches_bets = pd.merge(matches, all_bets, on=["Winner", "Loser"], how="inner")
    

    Then apply your condition by extra column ['date_difference']

    matches_bets['date_difference'] = matches_bets['Date of Match'] - matches_bets['Tournament Start Date']
    matches_bets['KEEP_IN_PLACE'] = matches_bets['date_difference'].apply(lambda x: 1 if x.days <= 14 else 0)
    matches_bets = matches_bets[matches_bets['KEEP_IN_PLACE'] == 1]
    

    And you're done =)