We have one single table with size 10 TB, 5 billion rows.
create table dummy_table (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
type INT, -- ENUM type, only few fixed values
create_date DATE,
user VARCHAR(10),
A VARCHAR(10),
B VARCHAR(10),
C VARCHAR(10),
D VARCHAR(10)
)
Data is immutable (no update operation is required. Only bulk insertion(ie. daily data load) and bulk deletion(ie. monthly data cleaning)).
Our prime use case to search by user and then type and then creation_date. For that, I am analyzing two strategies
create index user_type_creation_date_idx on dummy_table(user, type, create_date)
alter table dummy_table PARTITION BY LIST(type)
SUBPARTITION BY HASH(YEAR(create_date) + MONTH(create_date)) (
PARTITION pA VALUES IN (0) (
SUBPARTITION s0,
SUBPARTITION s1,
.....)
create index user_idx on dummy_table(user)
SELECT operation is giving almost same execution time. Problem I am facing is in bulk insertion. We are trying to pump 30 million rows(4.2 GB) from s3 file in it.
Without index and partition, it is taking around 360 sec to load that much data. But with STRATEGY2, data load time is increased to 850 sec, and with STRATEGY1, it is still running from past 15000 sec and still going on.
Dropping index is not in scope, as creating one is taking more than 7 hours and we are planning to have four more composite index
- index on user, A
- index on user, B
- index on user, C
- index on user, D
Here is my questions:
Version: MySQL: 5.6
Plan A: Include the index, no partitioning, and pre-sort the incoming data by user+type+create_date. This will be the fastest. But it needs room to presort the CSV file.
Plan B: No index, no partitioning, just load the data, but then ADD INDEX
. This needs a huge amount of disk space to do the sorting. This Plan may be close to as fast as Plan A.
Plan C: If you don't need id
and (user+type+create_date) is unique, then remove id
and do Plan A. Now this is the best approach.
The issues:
user
as the first item in the PK. How many rows per user
? (Sounds like millions?)PARTITION BY RANGE(TO_DAYS(...))
together with monthly partitions. This is because DROP PARTITION
is immensely faster than DELETE
. See this for more details. So, now I recommend A or C, plus monthly partitions.id
, then to benefit the common SELECT
(and because of the huge data size),do
PRIMARY KEY(user, type, create_date, id), -- clustered, unique, etc
INDEX(id) -- sufficient to keep AUTO_INCREMENT happy
The math puzzles me: 5B rows of about 100 bytes each would be about 1TB (after InnoDB overhead), yet you say 10TB.
I have assumed InnoDB; if you are using some other engine, I may need to modify my answer.
Thinking further 'out of the box'...
With InnoDB and parallel loads, you should be able to do the following with "transportable tablespaces". But it is not available for partitions until 5.7.4.
This would involve having a bunch of separate loads going on, each loading into a separate table. When finished use "transportable tablespaces" to move each into the main table as a partition.