I am sanitizing a contact form string :
$note = filter_var($_POST["note"], FILTER_SANITIZE_STRING);
Which works great except when people write in inches (") and feet ('). So I'm interested in 5" 8" 10" & 1'
comes up as I'm interested in 5" 8" 10" & 1'
Which is a bit of a garbled mess.
Can I sanitize yet keep my I'm 5'9"?
Computer data itself is neither harmful nor innocuous. It's just a piece of information that can be later be used for a given purpose.
Sometimes, data is used as computer source code and such code eventually leads to physical actions (a disk spins, a led blinks, a picture is uploaded to remote computer, a thermostat turns off the boiler...). And it's then (and only then) when data can become harmful; we even lose expensive space ships now and then because of software bugs.
Code you write yourself can be as harmful or innocuous as your abilities or good faith dictate. The big problem comes when your application has a vulnerability that allows execution of untrusted third-party code. This is particularly serious in web applications, which are connected to the open internet and are expected to receive data from anywhere in the world. But, how's that physically possible? There're several ways but the most typical case is due to dynamically generated code and this happens all the time in modern www. You use PHP to generate SQL, HTML, JavaScript... If you pick untrusted arbitrary data (e.g. an URL parameter or a form field) and use it to compose code that will later be executed (either by your server or by the visitor's browser) someone can be hacked (either you or your users).
You'll see that everyday here at Stack Overflow:
$username = $_POST["username"];
$row = mysql_fetch_array(mysql_query("select * from users where username='$username'"));
<td><?php echo $row["title"]; ?></td>
var id = "<?php echo $_GET["id"]; ?>";
Faced to this problem, some claim: let's sanitize! It's obvious that some characters are evil so we'll remove them all and we're done, right? And then we see stuff like this:
$username = $_POST["username"];
$username = strip_tags($username);
$username = htmlentities($username);
$username = stripslashes($username);
$row = mysql_fetch_array(mysql_query("select * from users where username='$username'"));
This is a surprisingly widespread misconception adopted even by some professionals. You see the symptoms everywhere: your comment is mutilated at first <
symbol, you get "your password cannot contain spaces" on sign-up and you read Why can’t I use certain words like "drop" as part of my Security Question answers? in the FAQ. It's even inside computer languages: whenever you read "sanitize", "escape"... in a function name (without further context), you have a good hint that it might be a misguided effort.
It's all about establishing a clear separation about data and code: user provides data but only you provide code. And there isn't a universal one-size-fits-all solution because each computer language has its own syntax and rules. DROP TABLE users;
can be terribly dangerous in SQL:
mysql> DROP TABLE users;
Query OK, 56020 rows affected (0.52 sec)
(oops!)... but it's not as bad in e.g. JavaScript. Look, it doesn't even run:
C:\>node
> DROP TABLE users;
SyntaxError: Unexpected identifier
at Object.exports.createScript (vm.js:24:10)
at REPLServer.defaultEval (repl.js:235:25)
at bound (domain.js:287:14)
at REPLServer.runBound [as eval] (domain.js:300:12)
at REPLServer.<anonymous> (repl.js:427:12)
at emitOne (events.js:95:20)
at REPLServer.emit (events.js:182:7)
at REPLServer.Interface._onLine (readline.js:211:10)
at REPLServer.Interface._line (readline.js:550:8)
at REPLServer.Interface._ttyWrite (readline.js:827:14)
>
This last example also illustrates that it's not only a security concern. Even if you're not being hacked, generating code from random input can simply make your app crash:
SELECT * FROM customers WHERE last_name='O'Brian';
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Brian''
So, what shall be done then if there isn't a universal solution?
Understand the problem:
If you inject raw literal data improperly it can become code (and sometimes invalid code).
Use the specific mechanism for each technology:
If target language requires escaping:
<p><3 to code</p>
→ <p><3 to code</p>
... find a specific tool to escape in source language:
echo '<p>' . htmlspecialchars($motto) . '</p>';
If language/framework/technology allows to send data in a separate channel, do it:
$sql = 'SELECT password_hash FROM user WHERE username=:username';
$params = array(
'username' => $username,
);