Search code examples
pythonstringnewline

Python Script read LF without CR in text file and replace with another character


I have several tab-delimited text files obtained from the FFIEC site (https://cdr.ffiec.gov/public/PWS/DownloadBulkData.aspx) <Call Reports - Single Period, Schedule RIE> that have LF (line feed) characters without CR (carriage return) characters in one or more of the fields. The files are being uploaded to SQL Server 2022 (or used in Excel). Each record (line) of the file ends with a CRLF sequence. The problem is that the LF in a field is interpreted as starting the next record when the text file is read (in Excel or using SSIS to import to SQL Server).

I am aware of the \r\n in Windows vs \n UNIX/Linux and suspect Python is handling either as a sequence. I have not tried Latin-1 or cp1252 encoding.

I am running Windows 11 Pro. The script is called from a shell command (SQL stored procedure or Excel VBA) and is part of a larger group of scripts to clean up the files for importing.

My attempted solution was to read the file, iterate through it one character at a time, find the LF '\n' that is not preceded by a CR '\r', and replace it with a semicolon ';'.

Python code (v3.12):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                output_data.append(';')
            # Skip this '\n'
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

The problem encountered is that the script replaces all LF AND all CRLF in the file with ';'.

Sample showing original document (LF without CR) Lines 10-14 are part of the same record. Lines 16-21 are one record.

Update: I need to read the manual! Since 3.x, Python has an option to ignore or use a different the auto-replace for newline. My original code also has a logic error in the while loop.

I ended up using this because it required less re-writes to the rest of my code. I did test the answer from @JRiggles and marked it as the solution (cleaner, less code):

import sys

def stripLFwoCR_file(file_path):
    # Read the entire file contents
    with open(file_path, 'r', encoding='utf-8', newline='\r\n') as file:
        input_data = file.read()

    # Initialize output
    output_data = []

    # Iterate input content 1 character at a time
    # Replace line feed characters '\n' not preceded by carriage return characters '\r' with ';'
    i = 0
    while i < len(input_data):
        if input_data[i] == '\n':
            # If previous character is not '\r' then replace '\n' with ';'
            if i == 0 or input_data[i-1] != '\r':
                # Skip this '\n' and replace
                output_data.append(';')
            else:
                output_data.append(input_data[i])
        else:
            output_data.append(input_data[i])
        i += 1

    # Write the modified content back to the file, overwriting it
    with open(file_path, 'w', encoding='utf-8', newline='\n') as file:
        file.write(''.join(output_data))

if __name__ == "__main__":
    args = sys.argv
    # args[0] = current file
    # args[1] = function name
    # args[2:] = function args : (*unpacked)
    globals()[args[1]](*args[2:])

Solution

  • This sounds like a job for re.sub. The pattern (?<!\r)\n will match any LF characters \n which aren't preceded by a carriage return (CR) \r.

    Here's a sample file, sample data.txt (screenshot showing line endings)

    enter image description here

    To avoid any line ending conversions, open the file in binary read mode 'rb'

    import re
    
    
    pattern = b'(?<!\r)\n'  # match any \n not preceded by \r
    
    with open(r'<path to>\sample data.txt', 'rb') as file:
        data = file.read()
        print('Pre-substitution: ', data)
        # replace any matches with a semicolon ';'
        result = re.sub(pattern, b';', data)
        print('Post-substitution: ', result)
    

    This prints:

    Pre-substitution:  b'this line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\nthis line ends with LF\nthis line ends with CRLF\r\n'
    Post-substitution:  b'this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\nthis line ends with LF;this line ends with CRLF\r\n'
    

    It's worth mentioning that consecutive \ns will all be substituted, so \n\n\n becomes ;;; and \r\n\n becomes r\n;.

    Note also that the pattern string and substitution value are both bytestrings (b'<str>') - if you don't do this, you'll get a TypeError!