Search code examples
pythonpython-3.xstringsplit

Python split() method removes not only whitespaces


While working on a problem I accidentally noticed that Python str.split() method without any parameters removes not only whitespaces as described in its latest official documentation, but it also removes '\n' that is placed anywhere in the string.

For example, suppose I want to split the following string ' a b c d \n ' using .split() without any parameters. The expected output according to the official documentation is the following list: ['a', 'b', 'c', 'd', '\n'] , however, you get the following: ['a', 'b', 'c', 'd'] .

The same is true no matter whether '\n' is at the end, in the beginning or inside the string.

I could not find anywhere mentioning this property of str.split() , thus I have a question: Is this behavior reliable and will it always do the same in any circumstance or is it just some kind of luckiness here?

I am running Python 3.10.8


Solution

  • You are misreading the documentation you linked to. The term "whitespace" includes the newline character. From the documentation for Python's str.isspace():

    A character is whitespace if in the Unicode character database (see unicodedata), either its general category is Zs ("Separator, space"), or its bidirectional class is one of WS, B, or S.

    From the Unicode entry, you can see the category is Cc (control) but the bidirectional class is B (paragraph separator), so it's covered by the whitespace definition quoted above (WS is Unicode whitespace, a subset of Python whitepace, and S is segment separator).

    The following transcript shows that the newline is indeed in that whitespace class:

    >>> "\n".isspace()
    True
    

    The full list can be obtained with the string.whitespace constant:

    >>> import string
    >>> string.whitespace
    ' \t\n\r\x0b\x0c'
    

    So it includes space, tab, newline, carriage return, vertical tab, and form feed.