I have a MYSQL InnoDB table table
with the following columns (table and column names changed):
where rel_ab
is a column that describes a relationship between 2 variables var_a
and var_b
at a given date. (var_a
and var_b
reference different tables)
The data is uploaded in daily batches, totaling around 7 million rows per day. The problem is, after only a few weeks, it is starting to take many hours to upload each new daily batch. Clearly we need to improve our table design. Here are some additional details of our table.
COMPRESSION="zlib"
.var_a
and var_b
, necessitated by the foreign keys.SELECT * FROM table WHERE date = <date>
for a given date <date>
. Selecting only takes a couple minutes.var_a
and var_b
refer to.df.to_sql('temp', con, if_exists='replace', index=False, method='multi')
, where we insert ignore from temp
to table
and then drop temp
.As a result, I was planning on doing at least one of the following:
var_a
and var_b
and relying on the data uploading process to do everything correctly. This is because in our use-case, neither index actually speeds up the query speeds.table_230501
with columns var_a
, var_b
, rel_ab
. This is because we only ever select from one date at a time.I know that the first solution could threaten data integrity, while the second solution would clutter our schema. In my limited experience, I've also never heard of the second option being done and couldn't find any example of this design online. Are either of these options a wise solution? Both will improve upload speed and decrease disk usage but both have their cons as well. Otherwise, what are other ways that I can improve upload speed?
Edit: My SHOW CREATE TABLE
should look like
CREATE TABLE table (
date date NOT NULL,
var_a int NOT NULL,
var_b int NOT NULL,
rel_ab decimal(19,16) NOT NULL,
PRIMARY KEY (date,`var_a`,`var_b`),
KEY a_idx (var_a),
KEY b_idx (var_b),
CONSTRAINT a FOREIGN KEY (var_a) REFERENCES other_table_a (var_a) ON DELETE RESTRICT ON UPDATE CASCADE,
CONSTRAINT b FOREIGN KEY (var_b) REFERENCES other_table_b (var_b) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION="zlib"
To speed up the upload, get rid of it. Seriously, why put the data in a table if the only thing you do is fetch exactly what was in one date's file? (Your Comment points out that the single file is really a few files. Combining them first is probably a good idea.)
If you do need the data in a table, let's discuss these...
SHOW CREATE TABLE
; what you provided may have left out some subtle items.LOAD DATA
? Hopefully not one row INSERTed
at a time. I don't know how Pandas works. (Nor how the 99 other packages that 'simplify' MySQL access work.) Please find out what it does under the covers. You may have to bypass Pandas to get better performance. Bulk loading is at least 10 times as fast as row-by-row.FLOAT
?INT
to reference them. Switching to a 3-byte MEDIUMINT [UNSIGNED]
would save at least 7MB per day.Multiple 'identical' tables is always unwise. One table is always better. But, as suggested above, zero tables is still better.