Search code examples
pythonsplitend-of-line

splitlines() and iterating over an opened file give different results


I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:

with open('test.txt', 'wb') as f:  # simulate a file with weird end-of-lines
    f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
    for l in f:
        print(l)
# b'abc\r\r\n'         
# b'def'

I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:

print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']

Even with keepends=True, it's not the same result.

Question: how to have the same behaviour than for l in f with splitlines()?

Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232

Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.


Solution

  • Why don't you just split it:

    input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
    result = input.split(b'\n') 
    print(result)
    
    [b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
    

    You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like

    fixed = [bstr + b'\n' for bstr in result]
    if input[-1] != b'\n':
        fixed[-1] = fixed[-1][:-1]
    print(fixed)
    
    [b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
    

    Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :

    def bin_split(input_str):
        start = 0
        while start>=0 :
            found = input_str.find(b'\n', start) + 1
            if 0 < found < len(input_str):
                yield input_str[start : found]
                start = found
            else:
                yield input_str[start:]
                break