We have to ingest and store 150 billion records in our MySQL InnoDB database. One field in particular is a field that is a VARCHAR as is taking up a lot of space. Its characteristics:
I've tried the following:
Is the varchar ordinary text? Such is compressible 3:1. Compressing just the one field may get it down to 25-30 bytes. Then use something like VARBINARY(99)
.
INT
(4 bytes) is not big enough for normalizing 15 billion distinct values, so you need something bigger. BIGINT
takes 8 bytes. BINARY(5)
and DECIMAL(11,0)
are 5 bytes each, but are messier to deal with.
But you are concerned by the normalization speed. I would be more concerned by the ingestion speed, especially if you need to index this column!
How long does it take to build the table? You haven't said what the schema is; I'll guess that you can put 100 rows in an InnoDB block. I'll say you are using SSDs and can get 10K IOPs. 1.5B blocks / 10K blocks/sec = 150K seconds = 2 days. This assumes no index other than an ordered PRIMARY KEY
. (If it is not ordered, then you will be jumping around the table, and you will need a lot more IOPs; change the estimate to 6 months.)
The index on the column will effectively be a table 150 billion 'rows' -- it will take several terabytes just for the index BTree. You can either index the field as you insert the rows, or you can build the index later.
So, maybe we can do the normalization in a similar way? But wait. You said the column was so big that you can't even get the table loaded? So we have to compress or normalize that column?
How will the load be done?
LOAD DATA
calls (probably best)? Single-row INSERTs
(change "2 days" to "2 weeks" at least)? Multi-row INSERTs
(100-1000 is good)?autocommit
? Short transactions? One huge transaction (this is deadly)? (Recommend 1K-10K rows per COMMIT
.)Or will the table be MyISAM? The disk footprint will be significantly smaller. Most of my other comments still apply.
Back to MD5/SHA2. Building the normalization table, assuming it is much bigger than can be cached in RAM, will be a killer, too, no matter how you do it. But, let's get some of the other details ironed out first.
See also TokuDB (available with newer versions of MariaDB) for good high-speed ingestion and indexing. TokuDB will slow down some for your table size, whereas InnoDB/MyISAM will slow to a crawl, as I already explained. TokuDB also compresses automatically; some say by 10x. I don't have any speed or space estimates, but I see TokuDB as very promising.
Plan B
It seems that the real problem is in compressing or normalizing the 'router address'. To recap: Of the 150 billion rows, there are about 15 billion distinct values, plus a small percentage of NULLs
. The strings average 75 bytes. Compressing may be ineffective because of the nature of the strings. So, let's focus on normalizing.
The id needs to be at least 5 bytes (to handle 15B distinct values); the string averages 75 bytes. (I assume that is bytes, not characters.) Add on some overhead for BTree, etc, and the total ends up somewhere around 2TB.
I assume the router addresses are rather random during the load of the table, so lookups for the 'next' address to insert is a random lookup in the ever-growing index BTree. Once the index grows past what can fit in the buffer_pool (less than 768GB), I/O will be needed more and more often. By the end of the load, approximately 3 out of 4 rows inserted will have to wait for a read from that index BTree to check for the row already existing. We are looking at a load time of months, even with SSDs.
So, what can be done? Consider the following. Hash the address with MD5 and UNHEX it - 16 bytes. Leave that in the table. Meanwhile write a file with the hex value of the md5, plus the router address - 150B lines (skipping the NULLs). Sort, with deduplication, the file. (Sort on the md5.) Build the normalization table from the sorted file (15B lines).
Result: The load is reasonably fast (but complex). The router address is not 75 bytes (nor 5 bytes), but 16. The normalization table exists and works.