I would like to know what exactly SQL is comparing when we use the ORDER BY statement. More specifically, I'm interested in the comparison when it compares string. Supposedly it is sorting it alphabetically, but what is it actually comparing?
My hunch tells me it could be comparing ASCII values of the characters starting from the left, which would also imply that the sorting is case sensitive ('Btest' would be smaller than 'atest'), but I am unable to find a source confirming this.
In MySQL, it depends on the effective collation. Collation is the set of rules that determine the position of characters in an ordered set and what characters are considered equal, and typically involve natural language rules. For example, Spanish used to have ch
as an independent letter located betwen c
and d
and then switched to being just individual c
and h
; MySQL has collations for both.
You can see available collations with these commands:
SHOW COLLATION; -- Display all
SHOW COLLATION WHERE charset = 'utf8mb4'; -- Filter by encoding
Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
---|---|---|---|---|---|---|
utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
utf8mb4_0900_as_ci | utf8mb4 | 305 | Yes | 0 | NO PAD | |
utf8mb4_0900_as_cs | utf8mb4 | 278 | Yes | 0 | NO PAD | |
utf8mb4_0900_bin | utf8mb4 | 309 | Yes | 1 | NO PAD | |
utf8mb4_bin | utf8mb4 | 46 | Yes | 1 | PAD SPACE | |
utf8mb4_croatian_ci | utf8mb4 | 245 | Yes | 8 | PAD SPACE | |
utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | Yes | 0 | NO PAD | |
utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | Yes | 0 | NO PAD | |
utf8mb4_czech_ci | utf8mb4 | 234 | Yes | 8 | PAD SPACE | |
utf8mb4_danish_ci | utf8mb4 | 235 | Yes | 8 | PAD SPACE |
[...]
Collation names in MySQL use some common substrings to indicate certain features:
ci
/ cs
for Case Insensitive / Case Sensitiveai
/ as
for Accent Sensitive / Accent Insensitive... and some others (full list here).
In MySQL, you can set collation at several levels:
So you always get one, either explicit or implicit.