Is there a reason that when sanitizing a string, the characters are converted to lowercase
as opposed to uppercase
?
I've see this convention in many languages, but in terms of my current environment, we'll say Rails
and/or Javascript
No specific reason to my knowledge, but neither uppercasing nor lowercasing is the whole story in the Unicode world.
For example, the German letter ß
is exactly equivalent to ss
; they're both lowercase, and a word spelled with ß
can also be spelled with ss
.
Conversely, in Turkish, ı
(dotless i) is distinct from i
(dotted i), but unless your locale is Turkish, uppercasing either one produces I
(dotless ASCII I). This changes meaning too. You don't want to use the wrong one; they aren't equivalent.
Because of this, some programming languages offer more specific "case normalizing" conversions per the case folding rules in section 3.13 of the Unicode standard; Python 3.3 introduced str.casefold
for that reason. It's much like .lower()
, but will also normalize stuff like ß
to ss
because they're logically equivalent (if you're uniquifying, you wouldn't want to treat two strings that differ only in ß
vs. ss
to be treated as different).
If you don't have case folding available in your language, then the distinction between normalizing as upper vs. lower case is mostly by convention.