Search code examples
pythonpython-3.xstringstrip

Using strip() on string removes prefix that does not match the argument


The string (expressed as UTF-8 bytes),

b'\xe8\xb0\x81\xe6\x98\xaf\xe8\xb0\x81\xe7\x9a\x84\xe5\x91\xa8\xe6\x9d\xb0\xe4\xbc\xa6' 

does not begin with

b'\xe6\x98\xaf\xe8\xb0\x81'

However, using strip() on it below does remove this prefix. Does anyone know why this is happening?

Python 3.6.8 (default, Apr 19 2021, 17:20:37) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> "谁是谁的周杰伦".strip("是谁")
'的周杰伦'
>>> bytes('是谁', encoding='UTF-8')
b'\xe6\x98\xaf\xe8\xb0\x81'
>>> bytes('谁是谁的周杰伦', encoding='UTF-8')
b'\xe8\xb0\x81\xe6\x98\xaf\xe8\xb0\x81\xe7\x9a\x84\xe5\x91\xa8\xe6\x9d\xb0\xe4\xbc\xa6'

Solution

  • The complex Unicode codepoints in your question make this a bit more confusing than needed. Consider this simpler example:

    >>> "abcde".strip("ba")
    'cde'
    

    str.strip is working as intended. The argument to strip is an iterable of characters, not a complete string. Prefixes and suffixes consistently entirely of any of the characters passed in any order get removed.

    Quoting the docs:

    The outermost leading and trailing chars argument values are stripped from the string. Characters are removed from the leading end until reaching a string character that is not contained in the set of characters in chars. A similar action takes place on the trailing end.

    If you want to remove an exact prefix, use str.removeprefix:

    >>> "谁是谁的周杰伦".removeprefix("是谁")
    '谁是谁的周杰伦'     # no match, bad order
    
    >>> "谁是谁的周杰伦".removeprefix("谁是")
    '谁的周杰伦'        # match