The string (expressed as UTF-8 bytes),
b'\xe8\xb0\x81\xe6\x98\xaf\xe8\xb0\x81\xe7\x9a\x84\xe5\x91\xa8\xe6\x9d\xb0\xe4\xbc\xa6'
does not begin with
b'\xe6\x98\xaf\xe8\xb0\x81'
However, using strip()
on it below does remove this prefix. Does anyone know why this is happening?
Python 3.6.8 (default, Apr 19 2021, 17:20:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "谁是谁的周杰伦".strip("是谁")
'的周杰伦'
>>> bytes('是谁', encoding='UTF-8')
b'\xe6\x98\xaf\xe8\xb0\x81'
>>> bytes('谁是谁的周杰伦', encoding='UTF-8')
b'\xe8\xb0\x81\xe6\x98\xaf\xe8\xb0\x81\xe7\x9a\x84\xe5\x91\xa8\xe6\x9d\xb0\xe4\xbc\xa6'
The complex Unicode codepoints in your question make this a bit more confusing than needed. Consider this simpler example:
>>> "abcde".strip("ba")
'cde'
str.strip
is working as intended. The argument to strip is an iterable of characters, not a complete string. Prefixes and suffixes consistently entirely of any of the characters passed in any order get removed.
Quoting the docs:
The outermost leading and trailing chars argument values are stripped from the string. Characters are removed from the leading end until reaching a string character that is not contained in the set of characters in chars. A similar action takes place on the trailing end.
If you want to remove an exact prefix, use str.removeprefix
:
>>> "谁是谁的周杰伦".removeprefix("是谁")
'谁是谁的周杰伦' # no match, bad order
>>> "谁是谁的周杰伦".removeprefix("谁是")
'谁的周杰伦' # match