Search code examples
phpmysqlregexutf-8sql-injection

What does this preg_replace do? (/[\xF0-\xF7].../)


Obviously $data is the string and we are removing the characters that satisfy the reg expression, but what characters are being specified by /[\xF0-\xF7].../ ?

 preg_replace('/[\xF0-\xF7].../', '', $data)

Also what what is the significance of these characters being replaced?

Edit for bounty: specifically, what exploit is this trying to prevent from occurring? The data is later used in mysql queries (non-pdo), so I presume some kind of injection attack is involved with these characters perhaps? Or not? I am trying to understand the logic behind this line of code in a script I am reading.


Solution

  • It removes 4 byte sequence from unicode string. In these first byte is always [\xF0-\xF7] and three dots are the rest of 3 bytes.

    According to http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html:

    The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.

    MySQL with utf8 encoding selected may truncate text at the point where the sequence appears and if error reporting isn't set to strict_trans_tables it may do it silently instead of throwing errors like SQLSTATE[HY000]: General error: 1366 Incorrect string value:.

    See these for further reference:

    Potentially truncating can lead to exploit.

    For example, there is a website with user named admin. Website allows anyone to register. Using truncated strings one probably will be able to insert another admin with different email bypassing unique check. Then suspend account and try using restore procedure. It will issue a query like SELECT * FROM users WHERE name = 'admin' and since original admin is the first record attacker will restore his password.