Search code examples
pythonsecurityuser-inputstring-formattingstring-interpolation

Which are safe methods and practices for string formatting with user input in Python 3?


My Understanding

From various sources, I have come to the understanding that there are four main techniques of string formatting/interpolation in Python 3 (3.6+ for f-strings):

  1. Formatting with %, which is similar to C's printf
  2. The str.format() method
  3. Formatted string literals/f-strings
  4. Template strings from the standard library string module

My knowledge of usage mainly comes from Python String Formatting Best Practices (source A):

  • str.format() was created as a better alternative to the %-style, so the latter is now obsolete
  • f-strings allow str.format()-like behavior only for string literals but are shorter to write and are actually somewhat-optimized syntactic sugar for concatenation
  • Template strings are safer than str.format() (demonstrated in the first source) and the other two methods (implied in the first source) when dealing with user input

I understand that the aforementioned vulnerability in str.format() comes from the method being usable on any normal strings where the delimiting braces are part of the string data itself. Malicious user input containing brace-delimited replacement fields can be supplied to the method to access environment attributes. I believe this is unlike the other ways of formatting where the programmer is the only one that can supply variables to the pre-formatted string. For example, f-strings have similar syntax to str.format() but, because f-strings are literals and the inserted values are evaluated separately through concatenation-like behavior, they are not vulnerable to the same attack (source B). Both %-formatting and Template strings also seem to only be supplied variables for substitution by the programmer; the main difference pointed out is Template's more limited functionality.

My Confusion

I have seen a lot of emphasis on the vulnerability of str.format() which leaves me with questions of what I should be wary of when using the other techniques. Source A describes Template strings as the safest of the above methods "due to their reduced complexity":

The more complex formatting mini-languages of the other string formatting techniques might introduce security vulnerabilities to your programs.

  1. Yes, it seems like f-strings are not vulnerable in the same way str.format() is, but are there known concerns about f-string security as is implied by source A? Is the concern more like risk mitigation for unknown exploits and unintended interactions?

I am not familiar with C and I don't plan on using the clunkier %/printf-style formatting, but I have heard that C's printf had its own potential vulnerabilities. In addition, both sources A and B seem to imply a lack of security with this method. The top answer in Source B says,

String formatting may be dangerous when a format string depends on untrusted data. So, when using str.format() or %-formatting, it's important to use static format strings, or to sanitize untrusted parts before applying the formatter function.

  1. Do %-style strings have known security concerns?
  2. Lastly, which methods should be used and how can user input-based attacks be prevented (e.g. filtering input with regex)?
    • More specifically, are Template strings really the safer option? and Can f-strings be used just as easily and safely while granting more functionality?

Solution

  • It doesn't matter which format you choose, any format and library can have its own downsides and vulnerabilities. The bigger questions you need to ask yourself is what is the risk factor and the scenario you are facing with, and what are you going to do about it. First ask yourself: will there be a scenario where a user or an external entity of some kind (for example - an external system) sends you a format string? If the answer is no, there is no risk. If the answer is yes, you need to see whether this is needed or not. If not - remove it to eliminate the risk. If you need it - you can perform whitelist-based input validation and exclude all format-specific special characters from the list of permitted characters, in order to eliminate the risk. For example, no format string can pass the ^[a-zA-Z0-9\s]*$ generic regular expression.

    So the bottom line is: it doesn't matter which format string type you use, what's really important is what do you do with it and how can you reduce and eliminate the risk of it being tampered.