Search code examples
pythonpython-2.xcpython

Which characters are considered whitespace by split()?


I am porting some Python 2 code that calls split() on strings, so I need to know its exact behavior. The documentation states that when you do not specify the sep argument, "runs of consecutive whitespace are regarded as a single separator".

Unfortunately, it does not specify which characters that would be. There are some obvious contenders (like space, tab, and newline), but Unicode contains plenty of other candidates.

Which characters are considered to be whitespace by split()?

Since the answer might be implementation-specific, I'm targeting CPython.

(Note: I researched the answer to this myself since I couldn't find it anywhere, so I'll be posting it here, hopefully for the benefit of others.)


Solution

  • Unfortunately, it depends on whether your string is an str or a unicode (at least, in CPython - I don't know whether this behavior is actually mandated by a specification anywhere).

    If it is an str, the answer is straightforward:

    • 0x09 Tab
    • 0x0a Newline
    • 0x0b Vertical Tab
    • 0x0c Form Feed
    • 0x0d Carriage Return
    • 0x20 Space

    Source: these are the characters with PY_CTF_SPACE in Python/pyctype.c, which are used by Py_ISSPACE, which is used by STRINGLIB_ISSPACE, which is used by split_whitespace.

    If it is a unicode, there are 29 characters, which in addition to the above are:

    • U+001c through 0x001f: File/Group/Record/Unit Separator
    • U+0085: Next Line
    • U+00a0: Non-Breaking Space
    • U+1680: Ogham Space Mark
    • U+2000 through 0x200a: various fixed-size spaces (e.g. Em Space), but note that Zero-Width Space is not included
    • U+2028: Line Separator
    • U+2029: Paragraph Separator
    • U+202f: Narrow No-Break Space
    • U+205f: Medium Mathematical Space
    • U+3000: Ideographic Space

    Note that the first four are also valid ASCII characters, which means that an ASCII-only string might split differently depending on whether it is an str or a unicode!

    Source: these are the characters listed in _PyUnicode_IsWhitespace, which is used by Py_UNICODE_ISSPACE, which is used by STRINGLIB_ISSPACE (it looks like they use the same function implementations for both str and unicode, but compile it separately for each type, with certain macros implemented differently). The docstring describes this set of characters as follows:

    Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'