I'm working on a project that requires me to generated billions of unique code. Currently I use MariaDB with InnoDB engine and python for generating random unique code, inserting batch of 5000 unique codes per generate cycle.
My table structure:
row_id int --primary key + autoincrement
unique_code varchar(10) --unique
The problem: Insert is getting really slow when I hit 500.000.000-ish unique codes, and I still needs to generate up to 3 billion of codes. Before hitting that much records, I can insert 300-400 million of unique codes just in few hours.
Any helps will be apreciated, thanks!
Update (22 Jan '19) Answering Rick James' solution. Here's some samples of generated codes:
RLXT$CPS1Y
Y4P$9K70WO
PKSTY9M$FR
T$0VEFL2B1
RX4$MEKVQL
My server has 32GB of RAM and relatively fast SAS hard disk, I think it more than enough for my needs (or it isn't?).
In my experience, TokuDB has slower insert rate and struggle before hitting 100m of records, so I went to InnoDB that time.
As for transaction I mentioned before: yes, 5000 records is inserted once at a time. It was so fast until 150m of codes, after that I noticed the speed has dropped gradually as the record grows. Now I'm hitting 800m of codes, it takes 10 to 15 secs for an insert cycle (5000 recs).
I was using autoincrement id for sorting & marking the record, because those codes will be transferred to another database for printing (production). So I need to know which code has been transferred and which one hasn't.
I will wait for further answer, in mean time I will try Rick's suggestions. Thanks!
Show us a sample of, say, the first 10 values.
Here's why you probably "hit the wall"... Indexes can be categorised (at a level) into two flavors:
Continuous, such as AUTO_INCREMENT
values, or TIMESTAMPs
, where you insert the rows chronologically, or even approximately chronologically. These values are inserted at the "end" of the table or index and hit only the last block (or few blocks) of the BTree. By having all the activity in only a few blocks, there is little I/O to performa.
Random, such as UUIDs, MD5, and other "random" values, presumably including yours. In this case the 'next' value to insert into the table/index is not likely to still be cached in RAM. So I/O is needed. While the table is not too big, all the index blocks can be held in RAM, so little I/O is needed. But after the index grows bigger than cache, more often the act of adding the 'next' value will need to do I/O. Your process will become slower and slower.
What to do?
Plan A: Add the 'random' index after inserting all the rows. Adding the index will be very slow, but probably faster in the long run because it can use a different algorithm.
Plan B: Don't create all the values up front. Instead, create the next one when you need it.
Plan C: Buy enough RAM to hold the 'random' index entirely in RAM. (Plan to have about 2x the index size.)
Plan D: You tried TokuDB? I would expect it to survive longer before it gets into serious trouble. What was your experience.
You mentioned transactions. Please elaborate. Did you mean each 5000 codes was INSERTed
in a transaction? That is probably optimal.
What charset and collation are you using for your unique number? You should probably use ascii and ascii_bin -- for speed and to avoid case folding issues.
And... Here's another thought on how to generate them. There will be no need to check for uniqueness as you go, since they will be generated unique:
Think of your 10-character strings as numbers encoded in base-95 encoding of integers. (or however many distinct characters you are allowing). We'll generate the numbers sequentially, convert them to strings, then randomize them.
The 'next' value is computed as a random value past the 'current' value. The random value needs to be between 1 and some increment that might be about a billion (it depends on how many numbers you ultimately want, the charset, etc.)
INSERT
batches of 5K (or whatever) into a MyISAM table that has no indexes.
When finished, do this:
CREATE TABLE real (
id ... AUTO_INCREMENT, -- do you really need this??
random CHAR(10), NOT NULL CHARSET ascii COLLATE ascii_bin,
PRIMARY KEY(id), -- what for?
INDEX(random) -- uniqueness has been checked
INSERT INTO real (random)
SELECT random FROM myisam_table
ORDER BY RAND();
Here's how that will perform:
INSERT
them into the real
table, creating sequential ids
as it goes.Note: This will create a huge undo table, so be sure to have a lot of disk space.
As for my comments about leaving off id
, UNIQUE
, etc, please provide information on how you intend to use real
, so I can agree with, or argue against, their need.
Another Plan
Don't pre-generate the values. Instead, generate a new value from the approx 14T possible values, check for dups, generate another if necessary. In this Plan, the table gradually grows as needed rather than having to struggle to build it initially. Instead a little effort (milliseconds) is expended whenever a new value is needed. This can be wrapped in a Stored Function to make it easy for the user.
The table would have just one column, unique_code CHAR(10) CHARSET ascii PRIMARY KEY
.