Search code examples
sortingbinarymariadbcollationcharacter-set

Binary data type collation with Characters in MariaDB


I am trying to get a deep understanding about RDBMS, I am learning MariaDB.

Struggling to get the Binary data type and how its collation works.

I understand the binary(n) data type takes n bytes. This means the smallest memory that could be allocated is 1 byte?

This is taken from the MariaDB documentation:

The BINARY type is similar to the CHAR type, but stores binary byte strings rather than non-binary character strings. M represents the column length in bytes. It contains no character set, and comparison and sorting are based on the numeric value of the bytes.

I have the following sub questions:

  1. What exactly is a numeric value of a byte ? Is it the same as an ASCII character getting a number? If not, how is a numeric value determined ?
  2. If I insert a character 'A' inside a binary data type, how is this encoded and stored ? In the Table plus client, I can't see this being stored in binary.
  3. In the same data type i could add chars and also a binary image for example, how is the numeric value of the byte string determined?

I have created a few tables to test this,

First I create a table to test this with Binary(1) this doesn't allow me to add more than 1 digit 0-9 or 1 character A-Z or a-z

And then I have created a table with Binary(10) to see how the sorting order works

CREATE TABLE BinaryEg1(
    b BINARY(1)
);

-- Adding two binary digits
INSERT INTO BinaryEg1 VALUES (0), (1);


-- Throws error, too long
INSERT INTO BinaryEg1 VALUES (10);


-- Adding basic chars to compare sort
INSERT INTO BinaryEg1 VALUES ('A'), ('a'), ('B'), ('b');

Here if I see the default sort order it places numbers before capital letters and then small case letters:

SELECT * FROM BinaryEg1 ORDER BY b;

Binary(1) column sort

-- Creating a table of 10 binary bytes to test how sorting works
CREATE TABLE BinaryEg10(
    b BINARY(10)
);

-- Inserting values into 10 byte binary column
INSERT INTO BinaryEg10 VALUES (HEX('A'));

-- Adding some numbers of varying lengths to test sort order
INSERT INTO BinaryEg10 VALUES
(110),
(44),
(999999),
(1111111111),
(7876);

SELECT * FROM BinaryEg10 ORDER BY b;

Now, I see the sort order:

Binary(10) sort

Please can someone explain how the sorting works, it seems like the first byte is the most important in the sequence, it ignores the rest of the bytes,

I understand, this might be an irrelevant and pointless question as normally i assume i should not use binary for storing chars, but i would still like to get a solid understanding about how this data type works


Solution

  • Thanks to the comment above in my question it gave me the idea of getting the hex value. Now, I have a deep understanding about how this works.

    The answers:

    What exactly is a numeric value of a byte ? Is it the same as an ASCII character getting a number? If not, how is a numeric value determined ?

    Each byte in the byte sequence, each byte is a number from 0-255 base 10, because each byte has 8 bits.

    When I add an ASCII character, it's numeric value maps to its hex numeric code, which maps to its decimal numeric code. So yes in the case of ASCII chars, the numeric code & numeric values are identical.

    If I insert a character 'A' inside a binary data type, how is this encoded and stored ? In the Table plus client, I can't see this being stored in binary.

    If I insert the character 'A' in a binary column, it will take 1 byte of memory, the same way it would if I added the char 'A' in a char column.

    It will be encoded into binary based on its Decimal/hex value from the Latin1 charset which is the default charset for MariaDB. This maps to ASCII and Unicode tables, so its the same.

    The char 'A' has a hex value of: 41. This when converted to binary will be stored as: 0100 0001 and that takes a byte of memory. The decimal value is 65. If comparing this char to another value in another row, the comparison will be based on this numeric value and not a collation as it would be with chars.

    In the same data type i could add chars and also a binary image for example, how is the numeric value of the byte string determined?

    It's important to understand and conceptualise the binary data type as a SEQUENCE OF BYTES. And each byte has a numeric value from 0 to 255.

    The comparison compares the numeric value of each byte working from left to right. If the bytes are equal, then it compares the byte value from the next set of bytes and so on.

    Here are some examples that show this:

    -- Creating a table with just 1 byte
    CREATE TABLE BinaryEg1(
        b BINARY
    );
    
    -- Inserting values:
    INSERT INTO BinaryEg1 VALUES (0), (1), ('A'), ('B'), ('a'), ('b'), (0);
    

    Now, I have created two columns, one with the normal value and the other with the hex value and we can see the numeric values add up to the numeric codes as found in ASCII, Latin1, UTF8 character sets:

    SELECT b, HEX(b) FROM BinaryEg1 ORDER BY b;
    

    Binary column with hex value sorted:

    I am creating another example with three columns to show the same point and we can see how columns are sorted in the same way:

    CREATE TABLE BinaryEg3(
        b BINARY(3)
    );
    
    INSERT INTO BinaryEg3 VALUES 
    (418), (518), (318);
    (320), (319), (317);
    

    Now, you can see the normal decimal number characters as well as their hex values and I can get an idea of how it would be sorted:

    SELECT b, HEX(b) FROM BinaryEg3 ORDER BY b;
    

    Hex values for sequence of 3 bytes

    Now, these values are stored in binary based on those hex numbers. Each byte has 2 hex digits, hence the 3 digits have 6 hex digits.

    Each byte is compared based on that numeric value, which is why 317 comes before 318 and 318 before 319.

    Because the hex/binary value of 3 and 1 is the same, but 7 < 8 < 9