What is the best way to safely read user input?

Let's consider a REST endpoint which receives a JSON object. One of the JSON fields is a String, so I want to validate that no malicious text is received.

@ValidateRequest
public interface RestService {
  @POST
  @Consumes(APPLICATION_JSON)
  @Path("endpoint")
  void postData (@Valid @NotNull Data data);
}

public class Data {
  @ValidString
  private String s;
  // get,set methods 
}

I'm using the bean validation framework via @ValidString to delegate the validation to the ESAPI library.

@Override
public boolean isValid (String value, ConstraintValidatorContext context) {
  return ESAPI.validator().isValidInput(
      "String validation",
      value,
      this.constraint.type(),
      this.constraint.maxLength(),
      this.constraint.nullable(),
      true);
}

This method canonicalizes the value (i.e. removes encryption) and then validates against a regular expression provided in the ESAPI config. The regex is not that important to the question, but it mostly whitelists 'safe' characters.

All good so far. However, in a few occasions, I need to accept 'less' safe characters like %, ", <, >, etc. because the incoming text is from an end user's free text input field.

Is there a known pattern for this kind of String sanitization? What kind of text can cause server-side problems if SQL queries are considered safe (e.g. using bind variables)? What if the user wants to store <script>alert("Hello")</script> as his description which at some point will be send back to the client? Do I store that in the DB? Is that a client-side concern?

Solution

When dealing with text coming from the user, best practice is to white list only known character sets as you stated. But that is not the whole solution, since there are times when that will not work, again as you pointed out sometimes "dangerous" characters are part of the valid character set.

When this happens you need to be very vigilant in how you handle the data. I, as well as the commenters, recommended is to keep the original data from the user in its original state as long as possible. Dealing with the data securely will be to use proper functions for the target domain/output.

SQL

When putting free format strings into a SQL database, best practice is to use prepared statements (in java this is the PreparedStatement object or using ORM that will automatically parameterizes the data.

To read more on SQL injection attacks and other forms of Injection attacks (XML, LDAP, etc.) I recommended OWASPS Top 10 - A1 Injections

XSS

You also mentioned what to do when outputting this data to client. In this case I you want to make sure you html encode the output for the proper context, aka contextual output encoding. ESAPI has Encoder Class/Interface for this. The important thing to note is which context (HTML Body, HTML Attribute, JavaScript, URL, etc.) will the data be outputted. Each area is going to be encoded differently.

Take for example the input: <script>alert('Hello World');<script>

Sample Encoding Outputs:

HTML: <script>alert('Hello World');<script>
JavaScript: \u003cscript\u003ealert(\u0027Hello World\u0027);\u003cscript\u003e URL: %3Cscript%3Ealert%28%27Hello%20World%27%29%3B%3Cscript%3E
Form URL: %3Cscript%3Ealert%28%27Hello+World%27%29%3B%3Cscript%3E
CSS: \00003Cscript\00003Ealert\000028\000027Hello\000020World\000027\000029\00003B\00003Cscript\00003E
XML: <script>alert('Hello World');<script>

For more reading on XSS look at OWASP Top 10 - A3 Cross-Site Scripting (XSS)