Search code examples
mysqlinnodb

Bulk data import in Mysql table without index drop


We have one single table with size 10 TB, 5 billion rows.

create table dummy_table (
  id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  type INT, -- ENUM type, only few fixed values
  create_date DATE,
  user VARCHAR(10),
  A VARCHAR(10),
  B VARCHAR(10),
  C VARCHAR(10),
  D VARCHAR(10)
)

Data is immutable (no update operation is required. Only bulk insertion(ie. daily data load) and bulk deletion(ie. monthly data cleaning)).

Our prime use case to search by user and then type and then creation_date. For that, I am analyzing two strategies

  1. STRATEGY1: via composite index:

create index user_type_creation_date_idx on dummy_table(user, type, create_date)

  1. STRATEGY2: via partition and index

alter table dummy_table PARTITION BY LIST(type) SUBPARTITION BY HASH(YEAR(create_date) + MONTH(create_date)) ( PARTITION pA VALUES IN (0) ( SUBPARTITION s0, SUBPARTITION s1, .....) create index user_idx on dummy_table(user)

SELECT operation is giving almost same execution time. Problem I am facing is in bulk insertion. We are trying to pump 30 million rows(4.2 GB) from s3 file in it.

Without index and partition, it is taking around 360 sec to load that much data. But with STRATEGY2, data load time is increased to 850 sec, and with STRATEGY1, it is still running from past 15000 sec and still going on.

Dropping index is not in scope, as creating one is taking more than 7 hours and we are planning to have four more composite index

- index on user, A
- index on user, B
- index on user, C
- index on user, D

Here is my questions:

  • In many SO post i have read NOT to create partition, but it seems that partition concept is performing well for bulk insertion. Is anything wrong I am doing with STRATEGY1?
  • Is there anyway to increase speed up bulk insertion like partition, sub-partition, mysql/innodb property, as we have advantage of single (without any join) table, that too no UPDATE is ever required?
  • Is there anyway to increase bulk insertion speed by loading multiple files in parallel? LOAD DATA FROM S3 is blocking other such command. Can we run them in parallel somehow?

Version: MySQL: 5.6


Solution

  • Plan A: Include the index, no partitioning, and pre-sort the incoming data by user+type+create_date. This will be the fastest. But it needs room to presort the CSV file.

    Plan B: No index, no partitioning, just load the data, but then ADD INDEX. This needs a huge amount of disk space to do the sorting. This Plan may be close to as fast as Plan A.

    Plan C: If you don't need id and (user+type+create_date) is unique, then remove id and do Plan A. Now this is the best approach.

    The issues:

    • Partitioning provides no performance benefit for your situation.
    • Inserting the rows in order or building the index by sorting -- either is much faster than inserting rows randomly.
    • More indexes: (user + A/B/C/D) -- this further necessitates having user as the first item in the PK. How many rows per user? (Sounds like millions?)
    • You mentioned monthly deletion. This really needs PARTITION BY RANGE(TO_DAYS(...)) together with monthly partitions. This is because DROP PARTITION is immensely faster than DELETE. See this for more details. So, now I recommend A or C, plus monthly partitions.
    • If you must have id, then to benefit the common SELECT (and because of the huge data size),

    do

    PRIMARY KEY(user, type, create_date, id), -- clustered, unique, etc
    INDEX(id)   -- sufficient to keep AUTO_INCREMENT happy
    

    The math puzzles me: 5B rows of about 100 bytes each would be about 1TB (after InnoDB overhead), yet you say 10TB.

    I have assumed InnoDB; if you are using some other engine, I may need to modify my answer.

    Thinking further 'out of the box'...

    With InnoDB and parallel loads, you should be able to do the following with "transportable tablespaces". But it is not available for partitions until 5.7.4.

    This would involve having a bunch of separate loads going on, each loading into a separate table. When finished use "transportable tablespaces" to move each into the main table as a partition.