Search code examples
pythonline-breaks

Why does os.linesep only work on certain strings in Python?


I want to read in the content of a webpage as a string and remove all the linebreaks. To make my script platform independent, I thought it'd be a good idea to look for os.linesep instead of '\n' or "\n\r". To repleace the unwanted characters with other characters, I use string.replace. It did not work with a webpage so, I used a txt-file for testing. The content of the file is straightforward:

This is line one
this is line two
why does linsep not work?
I don't get it!

So strangly, when I read in the file as binary stream and the decode it, it does find all the linebreaks. When I read in as text, it does not. I checked both, the assumed string and the assumed string converted from a binary stream if they're really strings, which both appear to be, according to type(). This really bugs me, can someone please give me an explanation of what I'm misunderstanding here?

Here's my test code:

file = open(r"C:\Users\path\LinebreakTest.txt", "r")
data = file.read().replace(os.linesep, "REPLACEMENT")
print(type(data))
print(data)

file = open(r"C:\Users\path\LinebreakTest.txt", "rb")
dataBin = file.read().decode("utf-8").replace("\n", "REPLACEMENT")
print(type(dataBin))
print(dataBin)

This is my output:

class 'str'
This is line one
this is line two
why does linsep not work?
I don't get it!

class 'str'
This is line one
REPLACEMENTthis is line two
REPLACEMENTwhy does linsep not work?
REPLACEMENTI don't get it!
REPLACEMENT

Thanks in advance!


Solution

  • The problem is that with os.linesep, you're assuming that the file you're processing was created on the same platform the script is running, which may not be the case - especially for websites, which are created on various development environments.

    The HTTP server does not convert newlines to the platform of the client, but instead streams the data as-is, hoping that the client itself is platform-indenepdent (which is the case for most modern browsers.

    Fortunatelly, there aren't that many line separators available. According to its sources, Python's own linesep can only actually be one of the two possible values: \n or \r\n.

    Therefore, I'd suggest to simplify things up. First replace any instance of '\r\n' with '\n', and then just split on '\n':

    data = file.read().replace('\r\n', '\n').replace('\n', "REPLACEMENT")