Search code examples
javascripttheorysanitization

sanitize upper vs lower case


Is there a reason that when sanitizing a string, the characters are converted to lowercase as opposed to uppercase?

I've see this convention in many languages, but in terms of my current environment, we'll say Rails and/or Javascript


Solution

  • No specific reason to my knowledge, but neither uppercasing nor lowercasing is the whole story in the Unicode world.

    For example, the German letter ß is exactly equivalent to ss; they're both lowercase, and a word spelled with ß can also be spelled with ss.

    Conversely, in Turkish, ı (dotless i) is distinct from i (dotted i), but unless your locale is Turkish, uppercasing either one produces I (dotless ASCII I). This changes meaning too. You don't want to use the wrong one; they aren't equivalent.

    Because of this, some programming languages offer more specific "case normalizing" conversions per the case folding rules in section 3.13 of the Unicode standard; Python 3.3 introduced str.casefold for that reason. It's much like .lower(), but will also normalize stuff like ß to ss because they're logically equivalent (if you're uniquifying, you wouldn't want to treat two strings that differ only in ß vs. ss to be treated as different).

    If you don't have case folding available in your language, then the distinction between normalizing as upper vs. lower case is mostly by convention.