Search code examples
pythonregexnewlinelinefeed

Replacing \n while keeping \r\n intact


I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact. I've tried line.replace but seems like it is not recognizing \r\n so next I tried regex

with open(filetoread, "r") as inf:
    with open(filetowrite, "w") as fixed:
        for line in inf:
            line = re.sub("(?<!\r)\n", " ", line)
            fixed.write(line)

but it is not keeping \r\n it is removing everything. I can't do it in Notepad++ it is crashing on this file.


Solution

  • You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open with r mode, and to keep them all in the input, you can read the file in in the binary mode using b. Then, you need to remember to also use the b prefix with the regex pattern and replacement.

    You can use

    with open(filetoread, "rb") as inf:
        with open(filetowrite, "wb") as fixed:
            fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))
    

    Now, the whole file will be read into a single string (with inf.read()) and the line breaks will be matched, and eventually replaced.

    Pay attention to

    • "rb" when reading file in
    • "wb" to write file out
    • re.sub(b"(?<!\r)\n", b" ", inf.read()) contains b prefixes with string literals, and inf.read() reads in the file contents into single variable.