Search code examples
phpmysqllaravelencodingutf-8

Encoding in Laravel 9.x — Validation fails due to incorrectly matching special characters


I previously asked a question about an encoding issue I was having in Laravel. Namely, my database query for ʔamal was returning ʔāmāl as a valid result — that is, the distinct characters a & ā were being interpreted as equivalent. It seems to have been an issue with the database table encoding. I was advised to change it to utf8mb4_unicode_ci & that seemed to work back then.

However, I'm still having a very similar issue — this time with validation. I'm creating a dictionary, so naturally the slug for each term's page should be unique. But now that I've created the word mīn the application isn't letting me create the word min (i.e. it does not consider .../mīn & .../min distinct). I've also just noticed for the first time that if, for example, I navigate to .../wēn & manually change the address to .../wen (a non-existent word), it still displays the page for the term wēn.

Before even discussing encoding, is this somehow just a matter of browser behavior & so on? I mean, is there even a way to require my application to consider .../mīn & .../min distinct? If that's the case, could someone clarify how to do that?


Solution

  • ā = a in virtually all utf8mb4 collations, even with the really old utf8_general_ci.

    So, I claim, unless you use one of these, the a-macron will be treated by MySQL as equal to most other "a".

    utf8_bin
    utf8mb4_bin
    utf8mb4_0900_as_ci   -- and any other %_as_% collation
    

    _bin says to compare bits;
    _as_ means "accent sensitive"

    In application code and in the browser, there are probably other rules. They are probably treating a = ā.

    More

    To test a particular collation for a particular comparison:

    SELECT 'a' = 'ā' COLLATE utf8mb4_0900_as_ci;
    

    Returning 1 means that they are treated equal; 0 if not. Note: This assumes that you are otherwise using the CHARACTER SET utf8mb4. There is also the WEIGHT_STRING() function.