Search code examples
pythonsplitasciiextended-ascii

How to split line at non-printing ascii character in Python


How can I split a line in Python at a non-printing ascii character (such as the long minus sign hex 0x97 , Octal 227)? I won't need the character itself. The information after it will be saved as a variable.


Solution

  • You can use re.split.

    >>> import re
    >>> re.split('\W+', 'Words, words, words.')
    ['Words', 'words', 'words', '']
    

    Adjust the pattern to only include the characters you want to keep.

    See also: stripping-non-printable-characters-from-a-string-in-python


    Example (w/ the long minus):

    >>> # \xe2\x80\x93 represents a long dash (or long minus)
    >>> s = 'hello – world'
    >>> s
    'hello \xe2\x80\x93 world'
    >>> import re
    >>> re.split("\xe2\x80\x93", s)
    ['hello ', ' world']
    

    Or, the same with unicode:

    >>> # \u2013 represents a long dash, long minus or so called en-dash
    >>> s = u'hello – world'
    >>> s
    u'hello \u2013 world'
    >>> import re
    >>> re.split(u"\u2013", s)
    [u'hello ', u' world']