Search code examples
mysqlhashmd5innodb

Most efficient way to store VARCHAR in 150B-row table


We have to ingest and store 150 billion records in our MySQL InnoDB database. One field in particular is a field that is a VARCHAR as is taking up a lot of space. Its characteristics:

  • Can be NULL
  • Highly duplicated but we can't de-dupe because it increases ingestion time exponentially
  • Average length is about 75 characters
  • It has to have an index as it will have to join with another table
  • We don't need to store it in human readable format but we need to be able to match it to another table which would have to have the same format for this column

I've tried the following:

  • Compressing the table, this helps with space but dramatically increases ingestion time, so I'm not sure compression is going to work for us
  • Tried hashing to SHA2 which reduced the string length to 56, which gives us reasonable space saving but just not quite enough. Also I'm not sure SHA2 will generate unique values for this sort of data
  • Was thinking about MD5 which would further reduce string length to probably the right level but not sure again whether MD5 is string enough to generate unique values to be able to match with another table

Solution

  • Is the varchar ordinary text? Such is compressible 3:1. Compressing just the one field may get it down to 25-30 bytes. Then use something like VARBINARY(99).

    INT (4 bytes) is not big enough for normalizing 15 billion distinct values, so you need something bigger. BIGINT takes 8 bytes. BINARY(5) and DECIMAL(11,0) are 5 bytes each, but are messier to deal with.

    But you are concerned by the normalization speed. I would be more concerned by the ingestion speed, especially if you need to index this column!

    How long does it take to build the table? You haven't said what the schema is; I'll guess that you can put 100 rows in an InnoDB block. I'll say you are using SSDs and can get 10K IOPs. 1.5B blocks / 10K blocks/sec = 150K seconds = 2 days. This assumes no index other than an ordered PRIMARY KEY. (If it is not ordered, then you will be jumping around the table, and you will need a lot more IOPs; change the estimate to 6 months.)

    The index on the column will effectively be a table 150 billion 'rows' -- it will take several terabytes just for the index BTree. You can either index the field as you insert the rows, or you can build the index later.

    • Building index as you insert, even with the benefit of InnoDB's "change buffer", will eventually slow down to not much faster than 1 disk hit per row inserted. Are you using SSDs? (Spinning drives are rated about 10ms/hit.) Let's say you can get 10K hits (inserts) per second. That works out to 15M seconds, which is 6 months.
    • Building the index after loading the entire table... This effectively builds a file with 150 billion lines, sorts it, then constructs the index in order. This may take a week, not months. But... It will require enough disk space for a second copy of the table (probably more) during the index-building.

    So, maybe we can do the normalization in a similar way? But wait. You said the column was so big that you can't even get the table loaded? So we have to compress or normalize that column?

    How will the load be done?

    • Multiple LOAD DATA calls (probably best)? Single-row INSERTs (change "2 days" to "2 weeks" at least)? Multi-row INSERTs (100-1000 is good)?
    • autocommit? Short transactions? One huge transaction (this is deadly)? (Recommend 1K-10K rows per COMMIT.)
    • Single threaded (perhaps cannot go fast enough)? Multi-threaded (other issues)?
    • My discussion of high-speed-ingestion.

    Or will the table be MyISAM? The disk footprint will be significantly smaller. Most of my other comments still apply.

    Back to MD5/SHA2. Building the normalization table, assuming it is much bigger than can be cached in RAM, will be a killer, too, no matter how you do it. But, let's get some of the other details ironed out first.

    See also TokuDB (available with newer versions of MariaDB) for good high-speed ingestion and indexing. TokuDB will slow down some for your table size, whereas InnoDB/MyISAM will slow to a crawl, as I already explained. TokuDB also compresses automatically; some say by 10x. I don't have any speed or space estimates, but I see TokuDB as very promising.

    Plan B

    It seems that the real problem is in compressing or normalizing the 'router address'. To recap: Of the 150 billion rows, there are about 15 billion distinct values, plus a small percentage of NULLs. The strings average 75 bytes. Compressing may be ineffective because of the nature of the strings. So, let's focus on normalizing.

    The id needs to be at least 5 bytes (to handle 15B distinct values); the string averages 75 bytes. (I assume that is bytes, not characters.) Add on some overhead for BTree, etc, and the total ends up somewhere around 2TB.

    I assume the router addresses are rather random during the load of the table, so lookups for the 'next' address to insert is a random lookup in the ever-growing index BTree. Once the index grows past what can fit in the buffer_pool (less than 768GB), I/O will be needed more and more often. By the end of the load, approximately 3 out of 4 rows inserted will have to wait for a read from that index BTree to check for the row already existing. We are looking at a load time of months, even with SSDs.

    So, what can be done? Consider the following. Hash the address with MD5 and UNHEX it - 16 bytes. Leave that in the table. Meanwhile write a file with the hex value of the md5, plus the router address - 150B lines (skipping the NULLs). Sort, with deduplication, the file. (Sort on the md5.) Build the normalization table from the sorted file (15B lines).

    Result: The load is reasonably fast (but complex). The router address is not 75 bytes (nor 5 bytes), but 16. The normalization table exists and works.