Search code examples
mysqlindexingsql-likecultureinfoutf8mb4

Get exact filter results with charset utf8mb4


I want to match filter results exactly using CHARACTER SET utf8mb4.

ALTER DATABASE test CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

CREATE TABLE users (    
username VARCHAR(25) NOT NULL,    
password VARCHAR(25) NULL NULL
) 
ENGINE=innodb DEFAULT CHARSET=utf8mb4 
COLLATE=utf8mb4_general_ci ROW_FORMAT=DYNAMIC;
CREATE  INDEX  users_username_idx ON users (username); 

I inserted 1 million rows also containing below rows.

INSERT INTO users SELECT 'üsasdasd','somepassword';
INSERT INTO users SELECT 'usasdasd','somepassword';
INSERT INTO users SELECT 'pğasdasdasd','somepassword';
INSERT INTO users SELECT 'anfüs','somepassword';
INSERT INTO users SELECT 'anfus','somepassword';

Then applied some queries.

SELECT * FROM users WHERE username LIKE 'üs%';
SELECT * FROM users WHERE username LIKE 'us%';
SELECT * FROM users WHERE username LIKE 'pğ%';
SELECT * FROM users WHERE username = 'anfüs';
SELECT * FROM users WHERE username = 'anfus';

Why do i get results that ALSO have ü letter(like üsasdasd or anfüs) when i filter as "LIKE 'us%'" or " = 'anfus'"
OR Why do i get results that ALSO have u letter(like usasdasd or anfus) when i filter as "LIKE 'üs%'" or " = 'anfüs'" ?

How can i get results that ONLY have ü letter(like üsasdasd or anfüs) when i filter as "LIKE 'üs%'" or " = 'anfüs'" vice versa ?

ANSWER:

I dont know why but, COLLATE=utf8mb4_0900_as_cs fixed the issue.


Solution

  • For passwords, you want to ignore all case folding, accent stripping, etc, so use COLLATE utf8mb4_bin.

    I suspect utf8mb4_0900_as_cs is not quite the same as _bin.

    On the other hand, do you really want to store "plaintext" passwords? A common technique is to store a hash of the pwd and do the hash when testing. A simple hash that is secure from casual hackers:

    MD5(CONCAT(password, 'my secret salt'))