Search code examples
phpmysqlsqlsqlperformance

Improve performance of complex SQL query


Since a few month by now I am collecting data of football matches from open sources. To do so I grab the data of a specific website URL, that presents the data for a specific football match, with PHP via XPath. Then I do some data editing so they fit the way I need them. The next and last step is to transfer them into my MySQL database in several tables.

With a fast growing database I slowly run into severe performance issues. Because I do everything locally on my computer and that one isn't a monster of machine, it already takes some time to process a match. To get a feeling how fast that is: In the first days of datamining a match took around 24 seconds. However, now the average went beyond a 60 seconds threshold.

Up until now I occasionally went into the PHP code and tried to improve it where possible, because I thought the main issue is in not so clean code snippets. While it helped a bit, the average time grew further after several days and lately I started to realize there must be another time consuming issue. So I made a test PHP script that performes some kind of logging while running the main code.

It shows that some SQL queries I do to insert the data in my database tables take a lot of time on average (I analyzed 100 matches here):

  • starting squad in DB: 6.44 seconds
  • benched substitutes in DB: 8.49 seconds

Checking the queries again I realized they are quite complex.

Those are the tables involved:

tblStartingSquad

+----+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+--------+  
| id | matchID | player1ID | player2ID | player3ID | player4ID | player5ID | player6ID | player7ID | player8ID | player9ID | player10ID | player11ID | clubID |
+----+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+--------+
| 1  |    1    |     1     |     2     |     3     |     4     |     5     |     6     |     7     |     8     |     9     |     10     |     11     |   1    |
| 2  |    1    |    12     |    13     |    14     |    15     |    16     |    17     |    16     |    17     |    18     |     19     |     20     |   2    |
| 3  |    2    |     1     |     2     |     3     |     4     |     5     |     6     |     7     |     8     |     9     |     10     |     11     |   1    |
| 4  |    2    |    21     |    22     |    23     |    24     |    25     |    26     |    27     |    28     |    29     |     30     |     31     |   3    |
+----+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+--------+

tblSubstitutes

+----+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------+  
| id | matchID | player12ID | player13ID | player14ID | player15ID | player16ID | player17ID | player18ID | player19ID | player20ID | player21ID | player22ID | player23ID | clubID |
+----+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------+
| 1  |    1    |     32     |     33     |     34     |     35     |     36     |     37     |     38     |     39     |     40     |     41     |     42     |     43     |   1    |
| 2  |    1    |     44     |     45     |     46     |     47     |     48     |     49     |     50     |     51     |     52     |     53     |     54     |     55     |   2    |
| 3  |    2    |     32     |     33     |     34     |     35     |     36     |     37     |     38     |     39     |     40     |     41     |     42     |     43     |   1    |
| 4  |    2    |     56     |     57     |     58     |     59     |     60     |     61     |     61     |     62     |     63     |     64     |     65     |     66     |   3    |
+----+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------+

tblMatch

+---------+---------------------+-------------+------------------+
| matchID |         date        |    coach1   |      coach2      |
+---------+---------------------+-------------+------------------+
|    1    | 2006-08-19 22:00:00 | Piotr Nowak | Fernando Clavijo |
|    2    | 2006-08-15 21:00:00 | Piotr Nowak |   Mustafa Ugur   |
+---------+---------------------+-------------+------------------+

tblPlayer

+----------+------------------------+------------------+
| playerID |       namePlayer       |      short       |
+----------+------------------------+------------------+
|     1    |       Enis Ulusan      |    enis-ulusan   |
|     2    |   Grant Robert Murray  |   grant-murray   |
|     3    |     Evgeniy Shpedt     |  evgeniy-shpedt  |
|     4    | Mihai Alexandru Costea |   mihai-costea   |
|     5    |       Jan Zolna        |     jan-zolna    |
|     6    |    Adrian Gheorghiu    | adrian-gheorghiu |
|     7    | Marius Marian Croitoru | marius-croitoru  |
|     8    |    Jacov Nachtailer    | jacov-nachtailer |
|    ...   |          ...           |        ...       |
+----------+------------------------+------------------+

tblClub

+--------+-----------------+
| clubID |    nameClub     |
+--------+-----------------+
|    1   |   D.C. United   |
|    2   | Colorado Rapids |
|    3   | Caykur Rizespor |
+--------+-----------------+

And those are the queries involved:

SQL Query starting squad

$tblstarting_squad = 'INSERT INTO tblStartingSquad (matchID, player1ID, player2ID, player3ID, player4ID, player5ID, player6ID, player7ID, player8ID, player9ID, player10ID, player11ID, clubID) 
                    SELECT
                        (SELECT matchID FROM tblMatch WHERE date = "' . $match_date . '" AND coach1 = "' . $match_coach_home . '" AND coach2 = "' . $match_coach_away . '"), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[0] . '" AND short = "' . $player_short[0] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[1] . '" AND short = "' . $player_short[1] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[2] . '" AND short = "' . $player_short[2] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[3] . '" AND short = "' . $player_short[3] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[4] . '" AND short = "' . $player_short[4] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[5] . '" AND short = "' . $player_short[5] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[6] . '" AND short = "' . $player_short[6] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[7] . '" AND short = "' . $player_short[7] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[8] . '" AND short = "' . $player_short[8] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[9] . '" AND short = "' . $player_short[9] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[10] . '" AND short = "' . $player_short[10] . '" LIMIT 1), 
                        (SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '" LIMIT 1)
                    WHERE NOT EXISTS (
                        SELECT e.matchID 
                        FROM tblStartingSquad As e
                        INNER JOIN tblMatch As m
                            ON e.matchID = m.matchID
                        WHERE m.date = "' . $match_date . '" AND m.coach1 = "' . $match_coach_home . '" AND m.coach2 = "' . $match_coach_away . '" AND e.clubID = (SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '")
                    );';

if (!mysqli_query($db_connection, $tblstarting_squad)) {
                    echo("Error description $tblstarting_squad: " . mysqli_error($db_connection) . "<br />");
                }

SQL Query benched substitutes

$tblsubstitutes = 'INSERT INTO tblSubstitutes (matchID, player12ID, player13ID, player14ID, player15ID, player16ID, player17ID, player18ID, player19ID, player20ID, player21ID, player22ID, player23ID, clubID) 
                    SELECT
                        (SELECT matchID FROM tblMatch WHERE date = "' . $match_date . '" AND coach1 = "' . $match_coach_home . '" AND coach2 = "' . $match_coach_away . '"), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[11] . '" AND short = "' . $player_short[11] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[12] . '" AND short = "' . $player_short[12] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[13] . '" AND short = "' . $player_short[13] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[14] . '" AND short = "' . $player_short[14] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[15] . '" AND short = "' . $player_short[15] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[16] . '" AND short = "' . $player_short[16] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[17] . '" AND short = "' . $player_short[17] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[18] . '" AND short = "' . $player_short[18] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[19] . '" AND short = "' . $player_short[19] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[20] . '" AND short = "' . $player_short[20] . '" LIMIT 1), 
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[21] . '" AND short = "' . $player_short[21] . '" LIMIT 1),
                        (SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[22] . '" AND short = "' . $player_short[22] . '" LIMIT 1), 
                        (SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '" LIMIT 1)
                    WHERE NOT EXISTS (
                        SELECT e.matchID 
                        FROM tblSubstitutes As e
                        INNER JOIN tblMatch As m
                            ON e.matchID = m.matchID
                        WHERE m.date = "' . $match_date . '" AND m.coach1 = "' . $match_coach_home . '" AND m.coach2 = "' . $match_coach_away . '" AND e.clubID = (SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '")
                    );';

if (!mysqli_query($db_connection, $tblsubstitutes)) {
                    echo("Error description $tblsubstitutes: " . mysqli_error($db_connection) . "<br />");
                }

Both queries are practically identical. They insert the playerID of 11 (respectively 12) players into the tblStartingSquad (respectively tblSubstitutes) if there's no other entry with identical data. The playerID has to be checked beforehand in the database as the raw data doesn't have the individual ID. That happenes by selecting it via namePlayer and short from the table tblPlayer.

The tables tblStartingSquad and tblSubstitutes itself contain 110,000 rows (for 55,000 matches) currently, tblPlayer is at 100,000 rows.

I've googled a bit for solutions but wasn't quite able to find anything that would improve the overall speed. What I understand as an issue is that I have to check every player individually, so I get 11 and 12 subqueries. That's not really elegant, but I really don't have an idea how to improve that. Maybe someone out here on StackOverflow has a suggestion?


Solution

  • Reconsider your wide table design for a long table design. Numbered suffixed columns are never ideal data storage. Rows are cheap. Columns are expensive. Joins, aggregation, searching, indexing, etc. are much easier in long format. Otherwise, your queries will be complex as you show with 12 subqueries or even self-joins!

    Interestingly, your tblClub and tblPlayer are in long format but not tblStartingSquad and tblSubstitutes! Simply, remove all extraneous player columns into one where rows indicate different players:

    tblStartingSquad

     ID   MatchID   PlayerID    ClubID  
      1         1          5         1
      2         1          8         1
      3         1          9         1
    ...
    

    tblSubstitutes

     ID   MatchID   PlayerID    ClubID
      1         1          2         1
      2         1         16         1
      3         1          7         1
    ...
    

    tblMatch (with renamed coach columns for clarity)

    ID                    Date      HomeCoach          AwayCoach
     1     2006-08-19 22:00:00    Piotr Nowak   Fernando Clavijo
     2     2006-08-15 21:00:00    Piotr Nowak       Mustafa Ugur
    

    PHP

    From this database design, you can then run a simpler PHP parameterized query call and even easier with PDO, instead of mysqli for binding many parameters from arrays.

    // OPEN CONNECTION
    $dbconn = new PDO("mysql:host=$servername;dbname=$dbname", $username, $password);
    // SET PDO ERROR MODE TO EXCEPTION
    $dbconn -> setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
    
    // PREPARED STATEMENT
    $sql = "INSERT INTO tblStartingSquad (`Match`, `PlayerID`, `ClubID`)
            SELECT m.MatchID, p1.PlayerID, c.ClubID
            FROM 
               (SELECT p.PlayerID
                FROM tblPlayer p
                WHERE p.namePlayer IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
               ) p1
            INNER JOIN 
               (SELECT p.PlayerID
                FROM tblPlayer p
                WHERE p.short IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
               ) p2 ON p1.PlayerID = p2.PlayerID
            CROSS JOIN
                (SELECT MatchID 
                 FROM tblMatch
                 WHERE `date` = ? AND HomeCoach = ? AND AwayCoach = ?) m
            CROSS JOIN
                (SELECT ClubID 
                 FROM tblClub
                 WHERE nameClub = ?) c
            WHERE NOT EXISTS
                 (SELECT 1 FROM tblStartingSquad As e
                  WHERE e.MatchID = m.matchID)"
    
    try {
         // INITIALIZE STATEMENT
         $stmt = $dbconn->prepare($sql);
    
         $params = array($player_name[0], $player_name[1], $player_name[2], 
                         $player_name[3], $player_name[4], $player_name[5], 
                         $player_name[6], $player_name[7], $player_name[8], 
                         $player_name[9], $player_name[10],  
                         $player_short[0], $player_short[1], $player_short[2],
                         $player_short[3], $player_short[4], $player_short[5], 
                         $player_short[6], $player_short[7], $player_short[8], 
                         $player_short[9], $player_short[10],         
                         $match_date, $match_coach_home, $match_coach_away, $match_club);
    
         // ITERATIVELY BIND PARAMS
         foreach($params as $key => $val) {
            $stmt->bindParam($key+1, $val, PDO::PARAM_STR);
         }
    
         // EXECUTE ACTION
         $stmt->execute();
    
    } catch (PDOException $e) {
         echo "Error: " . $e->getMessage();
    }
    

    Do a similar call for tblSubstitutes, adjusting append query's destination and WHERE clause and parameter values.