Since a few month by now I am collecting data of football matches from open sources. To do so I grab the data of a specific website URL, that presents the data for a specific football match, with PHP via XPath. Then I do some data editing so they fit the way I need them. The next and last step is to transfer them into my MySQL database in several tables.
With a fast growing database I slowly run into severe performance issues. Because I do everything locally on my computer and that one isn't a monster of machine, it already takes some time to process a match. To get a feeling how fast that is: In the first days of datamining a match took around 24 seconds. However, now the average went beyond a 60 seconds threshold.
Up until now I occasionally went into the PHP code and tried to improve it where possible, because I thought the main issue is in not so clean code snippets. While it helped a bit, the average time grew further after several days and lately I started to realize there must be another time consuming issue. So I made a test PHP script that performes some kind of logging while running the main code.
It shows that some SQL queries I do to insert the data in my database tables take a lot of time on average (I analyzed 100 matches here):
Checking the queries again I realized they are quite complex.
Those are the tables involved:
tblStartingSquad
+----+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+--------+
| id | matchID | player1ID | player2ID | player3ID | player4ID | player5ID | player6ID | player7ID | player8ID | player9ID | player10ID | player11ID | clubID |
+----+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+--------+
| 1 | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
| 2 | 1 | 12 | 13 | 14 | 15 | 16 | 17 | 16 | 17 | 18 | 19 | 20 | 2 |
| 3 | 2 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
| 4 | 2 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 3 |
+----+---------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------------+------------+--------+
tblSubstitutes
+----+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------+
| id | matchID | player12ID | player13ID | player14ID | player15ID | player16ID | player17ID | player18ID | player19ID | player20ID | player21ID | player22ID | player23ID | clubID |
+----+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------+
| 1 | 1 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 1 |
| 2 | 1 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 2 |
| 3 | 2 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 1 |
| 4 | 2 | 56 | 57 | 58 | 59 | 60 | 61 | 61 | 62 | 63 | 64 | 65 | 66 | 3 |
+----+---------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------+
tblMatch
+---------+---------------------+-------------+------------------+
| matchID | date | coach1 | coach2 |
+---------+---------------------+-------------+------------------+
| 1 | 2006-08-19 22:00:00 | Piotr Nowak | Fernando Clavijo |
| 2 | 2006-08-15 21:00:00 | Piotr Nowak | Mustafa Ugur |
+---------+---------------------+-------------+------------------+
tblPlayer
+----------+------------------------+------------------+
| playerID | namePlayer | short |
+----------+------------------------+------------------+
| 1 | Enis Ulusan | enis-ulusan |
| 2 | Grant Robert Murray | grant-murray |
| 3 | Evgeniy Shpedt | evgeniy-shpedt |
| 4 | Mihai Alexandru Costea | mihai-costea |
| 5 | Jan Zolna | jan-zolna |
| 6 | Adrian Gheorghiu | adrian-gheorghiu |
| 7 | Marius Marian Croitoru | marius-croitoru |
| 8 | Jacov Nachtailer | jacov-nachtailer |
| ... | ... | ... |
+----------+------------------------+------------------+
tblClub
+--------+-----------------+
| clubID | nameClub |
+--------+-----------------+
| 1 | D.C. United |
| 2 | Colorado Rapids |
| 3 | Caykur Rizespor |
+--------+-----------------+
And those are the queries involved:
SQL Query starting squad
$tblstarting_squad = 'INSERT INTO tblStartingSquad (matchID, player1ID, player2ID, player3ID, player4ID, player5ID, player6ID, player7ID, player8ID, player9ID, player10ID, player11ID, clubID)
SELECT
(SELECT matchID FROM tblMatch WHERE date = "' . $match_date . '" AND coach1 = "' . $match_coach_home . '" AND coach2 = "' . $match_coach_away . '"),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[0] . '" AND short = "' . $player_short[0] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[1] . '" AND short = "' . $player_short[1] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[2] . '" AND short = "' . $player_short[2] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[3] . '" AND short = "' . $player_short[3] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[4] . '" AND short = "' . $player_short[4] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[5] . '" AND short = "' . $player_short[5] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[6] . '" AND short = "' . $player_short[6] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[7] . '" AND short = "' . $player_short[7] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[8] . '" AND short = "' . $player_short[8] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[9] . '" AND short = "' . $player_short[9] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[10] . '" AND short = "' . $player_short[10] . '" LIMIT 1),
(SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '" LIMIT 1)
WHERE NOT EXISTS (
SELECT e.matchID
FROM tblStartingSquad As e
INNER JOIN tblMatch As m
ON e.matchID = m.matchID
WHERE m.date = "' . $match_date . '" AND m.coach1 = "' . $match_coach_home . '" AND m.coach2 = "' . $match_coach_away . '" AND e.clubID = (SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '")
);';
if (!mysqli_query($db_connection, $tblstarting_squad)) {
echo("Error description $tblstarting_squad: " . mysqli_error($db_connection) . "<br />");
}
SQL Query benched substitutes
$tblsubstitutes = 'INSERT INTO tblSubstitutes (matchID, player12ID, player13ID, player14ID, player15ID, player16ID, player17ID, player18ID, player19ID, player20ID, player21ID, player22ID, player23ID, clubID)
SELECT
(SELECT matchID FROM tblMatch WHERE date = "' . $match_date . '" AND coach1 = "' . $match_coach_home . '" AND coach2 = "' . $match_coach_away . '"),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[11] . '" AND short = "' . $player_short[11] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[12] . '" AND short = "' . $player_short[12] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[13] . '" AND short = "' . $player_short[13] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[14] . '" AND short = "' . $player_short[14] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[15] . '" AND short = "' . $player_short[15] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[16] . '" AND short = "' . $player_short[16] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[17] . '" AND short = "' . $player_short[17] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[18] . '" AND short = "' . $player_short[18] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[19] . '" AND short = "' . $player_short[19] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[20] . '" AND short = "' . $player_short[20] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[21] . '" AND short = "' . $player_short[21] . '" LIMIT 1),
(SELECT playerID FROM tblPlayer WHERE namePlayer = "' . $player_name[22] . '" AND short = "' . $player_short[22] . '" LIMIT 1),
(SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '" LIMIT 1)
WHERE NOT EXISTS (
SELECT e.matchID
FROM tblSubstitutes As e
INNER JOIN tblMatch As m
ON e.matchID = m.matchID
WHERE m.date = "' . $match_date . '" AND m.coach1 = "' . $match_coach_home . '" AND m.coach2 = "' . $match_coach_away . '" AND e.clubID = (SELECT clubID FROM tblClub WHERE nameClub = "' . $match_club[1] . '")
);';
if (!mysqli_query($db_connection, $tblsubstitutes)) {
echo("Error description $tblsubstitutes: " . mysqli_error($db_connection) . "<br />");
}
Both queries are practically identical. They insert the playerID
of 11 (respectively 12) players into the tblStartingSquad
(respectively tblSubstitutes
) if there's no other entry with identical data. The playerID
has to be checked beforehand in the database as the raw data doesn't have the individual ID. That happenes by selecting it via namePlayer
and short
from the table tblPlayer
.
The tables tblStartingSquad
and tblSubstitutes
itself contain 110,000 rows (for 55,000 matches) currently, tblPlayer
is at 100,000 rows.
I've googled a bit for solutions but wasn't quite able to find anything that would improve the overall speed. What I understand as an issue is that I have to check every player individually, so I get 11 and 12 subqueries. That's not really elegant, but I really don't have an idea how to improve that. Maybe someone out here on StackOverflow has a suggestion?
Reconsider your wide table design for a long table design. Numbered suffixed columns are never ideal data storage. Rows are cheap. Columns are expensive. Joins, aggregation, searching, indexing, etc. are much easier in long format. Otherwise, your queries will be complex as you show with 12 subqueries or even self-joins!
Interestingly, your tblClub and tblPlayer are in long format but not tblStartingSquad and tblSubstitutes! Simply, remove all extraneous player columns into one where rows indicate different players:
tblStartingSquad
ID MatchID PlayerID ClubID
1 1 5 1
2 1 8 1
3 1 9 1
...
tblSubstitutes
ID MatchID PlayerID ClubID
1 1 2 1
2 1 16 1
3 1 7 1
...
tblMatch (with renamed coach columns for clarity)
ID Date HomeCoach AwayCoach
1 2006-08-19 22:00:00 Piotr Nowak Fernando Clavijo
2 2006-08-15 21:00:00 Piotr Nowak Mustafa Ugur
PHP
From this database design, you can then run a simpler PHP parameterized query call and even easier with PDO, instead of mysqli
for binding many parameters from arrays.
// OPEN CONNECTION
$dbconn = new PDO("mysql:host=$servername;dbname=$dbname", $username, $password);
// SET PDO ERROR MODE TO EXCEPTION
$dbconn -> setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
// PREPARED STATEMENT
$sql = "INSERT INTO tblStartingSquad (`Match`, `PlayerID`, `ClubID`)
SELECT m.MatchID, p1.PlayerID, c.ClubID
FROM
(SELECT p.PlayerID
FROM tblPlayer p
WHERE p.namePlayer IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
) p1
INNER JOIN
(SELECT p.PlayerID
FROM tblPlayer p
WHERE p.short IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
) p2 ON p1.PlayerID = p2.PlayerID
CROSS JOIN
(SELECT MatchID
FROM tblMatch
WHERE `date` = ? AND HomeCoach = ? AND AwayCoach = ?) m
CROSS JOIN
(SELECT ClubID
FROM tblClub
WHERE nameClub = ?) c
WHERE NOT EXISTS
(SELECT 1 FROM tblStartingSquad As e
WHERE e.MatchID = m.matchID)"
try {
// INITIALIZE STATEMENT
$stmt = $dbconn->prepare($sql);
$params = array($player_name[0], $player_name[1], $player_name[2],
$player_name[3], $player_name[4], $player_name[5],
$player_name[6], $player_name[7], $player_name[8],
$player_name[9], $player_name[10],
$player_short[0], $player_short[1], $player_short[2],
$player_short[3], $player_short[4], $player_short[5],
$player_short[6], $player_short[7], $player_short[8],
$player_short[9], $player_short[10],
$match_date, $match_coach_home, $match_coach_away, $match_club);
// ITERATIVELY BIND PARAMS
foreach($params as $key => $val) {
$stmt->bindParam($key+1, $val, PDO::PARAM_STR);
}
// EXECUTE ACTION
$stmt->execute();
} catch (PDOException $e) {
echo "Error: " . $e->getMessage();
}
Do a similar call for tblSubstitutes, adjusting append query's destination and WHERE
clause and parameter values.