I need to store entries in the range of billions, so row real estate is pretty precious here. There is a short string for every entry with the following spec:
I'd like to do something like CHAR(50) in InnoDB with utf8mb4 as charset, except that the 50 means byte-length not character-length. Is this possible? I'd prefer the data remain legible, but this is not a requirement.
No.
Use VARCHAR(20)
. That allows up to 20 utf8mb4 characters. I suggest you use a number a little larger than 20, just in case the specifications change later. Your first two bullet items would take "up to" 21 and 41 bytes, respectively. The "1" is for the "length".
Back to CHAR
. It is "fixed" length. And it refers to characters. That is CHAR(20)
utf8mb4 is always 80 bytes. Well, not always. InnoDB silently changes it to a hybrid between fixed- and variable-length. Let's not go into that mess.
It is good to try to conserve space. Be sure to do likewise with the various INTs
.
Alas, InnoDB takes 2x-3x the space that you might predict. This overhead is necessary for (1) handling ACID, and (2) speed efficiency.
More
"Word" alignment is not useful -- The code is too generic to take any advantage of such. [VAR]BINARY counts bytes; it does not character set checks, so it is "faster". The "VAR" takes one byte, but saves in that the string, itself, may be shorter than the max.
For VAR
, there is no difference between (20), (40), (50), etc, with two exceptions. The max is checked when inserting, and beyond some point, there needs to be 2 bytes for the "length".
Show us the variety of data that you will be storing as strings. We may be able to advise in more detail. For example, the accented letters of western Europe take 1 byte with character set latin1, but 2 bytes in utf8. VARBINARY would blindly take whatever the client gives it -- with no understanding or conversion of the encoding.
Since you mentioned "4-byte", I infer that you must use VARCHAR(...), CHARACTER SET utf8mb4
and the max needs to the max number of characters, not bytes. Switching to VARBINARY(...)
gives you no change in the space required. However, the max needs to be in bytes. Based on your specs, it sounds like either of these would suffice and take an identical amount of disk space:
VARCHAR(20) -- but make that a little bigger, just in case
VARBINARY(50) -- ditto
Have you also checked all the numeric columns? Many people blindly use 4-byte INT or 8-byte BIGINT when they don't need such a big range. Similarly for FLOAT/DOUBLE and DECIMAL.