Search code examples
pythonpython-2.7readlines

Python restrict newline characters for readlines()


I am trying to split a text which uses a mix of new line characters LF, CRLF and NEL. I need the best method to exclude NEL character out of the scene.

Is there an option to instruct readlines() to exlude NEL while splitting lines? I may be able to read() and go for matching only LF and CRLF split points on a loop.

Is there any better solution?

I open the file with codecs.open() to open utf-8 text file.

And while using readlines(), it does split at NEL characters:

session screenshot

The file contents are:

"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"

Solution

  • file.readlines() will only ever split on \n, \r or \r\n depending on the OS and if universal newline support is enabled.

    U+0085 NEXT LINE (NEL) is not recognised as a newline splitter in that context, and you don't need to do anything special to have file.readlines() ignore it.

    Quoting the open() function documentation:

    Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

    and the universal newlines glossary entry:

    A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use.

    Unfortunately, codecs.open() breaks with this rule; the documentation vaguely alludes to the specific codec being asked:

    Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.

    Instead of codecs.open(), use io.open() to open the file in the correct encoding, then process the lines one by one:

    with io.open(filename, encoding=correct_encoding) as f:
        lines = f.open()
    

    io is the new I/O infrastructure that replaces the Python 2 system entirely in Python 3. It handles just \n, \r and \r\n:

    >>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
    >>> import codecs
    >>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
    [u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
    >>> import io
    >>> io.open('/tmp/test.txt', encoding='utf8').readlines()
    [u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']
    

    The codecs.open() result is due to the code using str.splitlines() being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue). The documentation for this method is falling short of explaining this; it claims to only split according to the Universal Newline rules.