Search code examples
pythonstringcomparison

How are strings compared?


I'm wondering how Python does string comparison, more specifically how it determines the outcome when a less than < or greater than > operator is used.

For instance if I put print('abc' < 'bac') I get True. I understand that it compares corresponding characters in the string, however its unclear as to why there is more, for lack of a better term, "weight" placed on the fact that a is less thanb (first position) in first string rather than the fact that a is less than b in the second string (second position).


Many people ask this question when the strings contain representations of numbers, and want to compare the numbers by numeric value. The straightforward solution is to convert the values first. See How do I parse a string to a float or int? . If there are multiple numbers in a list or other collection, see How can I collect the results of a repeated calculation in a list, dictionary etc. (or make a copy of a list with each element modified)? for batch conversion.

If you are trying to compare strings that contain digit sequences, treating the digits as if they were numeric (sometimes called "natural sort"), see Is there a built in function for string natural sort? .


Solution

  • From the docs:

    The comparison uses lexicographical ordering: first the first two items are compared, and if they differ this determines the outcome of the comparison; if they are equal, the next two items are compared, and so on, until either sequence is exhausted.

    Also:

    Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

    or on Python 2:

    Lexicographical ordering for strings uses the ASCII ordering for individual characters.

    As an example:

    >>> 'abc' > 'bac'
    False
    >>> ord('a'), ord('b')
    (97, 98)
    

    The result False is returned as soon as a is found to be less than b. The further items are not compared (as you can see for the second items: b > a is True).

    Be aware of lower and uppercase:

    >>> [(x, ord(x)) for x in abc]
    [('a', 97), ('b', 98), ('c', 99), ('d', 100), ('e', 101), ('f', 102), ('g', 103), ('h', 104), ('i', 105), ('j', 106), ('k', 107), ('l', 108), ('m', 109), ('n', 110), ('o', 111), ('p', 112), ('q', 113), ('r', 114), ('s', 115), ('t', 116), ('u', 117), ('v', 118), ('w', 119), ('x', 120), ('y', 121), ('z', 122)]
    >>> [(x, ord(x)) for x in abc.upper()]
    [('A', 65), ('B', 66), ('C', 67), ('D', 68), ('E', 69), ('F', 70), ('G', 71), ('H', 72), ('I', 73), ('J', 74), ('K', 75), ('L', 76), ('M', 77), ('N', 78), ('O', 79), ('P', 80), ('Q', 81), ('R', 82), ('S', 83), ('T', 84), ('U', 85), ('V', 86), ('W', 87), ('X', 88), ('Y', 89), ('Z', 90)]
    

    Specifically, this has the consequence of 'a' > 'A', 'b' > 'B', etc. including 'a' > 'Z' all evaluate to True as all lowercase characters from a to z have a higher code point number than all uppercase characters.