Search code examples
pythoneol

Remove multiple EOL in file


I have a tab delimited file with \n EOL characters that looks something like this:

User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n

I am taking this input file and reformatting it into a nested list using split('\t'). The list should look like this:

[['User Name','Code','Track','Color','Note'],
 ['User Name2','Code2','Track2','Color2','Note2']]

The software that generates the file allows the user to press "enter" key any number of times while filling out the "Note" field. It also allows the user to press "enter" creating any number of newlines without entering any visible text in the "Note" field at all.

Lastly, the user may press "enter" any number of times in the middle of the "Note" creating multiple paragraphs, but this would be such a rare occurrence from the operational standpoint that I am willing to leave this eventuality not addressed if it complicates the code much. This possibility is really, really low priority.

As seen in the sample above, these actions can result in a sequence of "\n\n..." codes of any length preceding, trailing or replacing the "Note" field. Or to put it this way, the following replacements are required before I can place the file object into a list:

\t\n\n... preceding "Note" must become \t
\n\n... trailing "note" must become \n
\n\n... in place of "note" must become \n
\n\n... in the middle of the text note must become a single whitespace, if easy to do

I have tried using strip() and replace() methods without success. Does the file object need to be copied into something else first before replace() method can be used on it?

I have experience with Awk, but I am hoping Regular Expressions are not needed for this as I am very new to Python. This is the code that I need to improve in order to address multiple newlines:

marker = [i.strip() for i in open('SomeFile.txt', 'r')]

marker_array = []
for i in marker:
    marker_array.append(i.split('\t'))

for i in marker_array:
    print i

Solution

  • Count the tabs; if you presume that the note field never has 4 tabs on one line in it, you can collect the note until you find a line that does have 4 tabs in it:

    def collapse_newlines(s):
        # Collapse multiple consecutive newlines into one; removes trailing newlines
        return '\n'.join(filter(None, s.split('\n')))
    
    def read_tabbed_file(filename):
        with open(filename) as f:
            row = None
            for line in f:
                if line.count('\t') < 4:   # Note continuation
                    row[-1] += line
                    continue
    
                if row is not None:
                    row[-1] = collapse_newlines(row[-1])
                    yield row
    
                row = line.split('\t')
    
            if row is not None:
                row[-1] = collapse_newlines(row[-1])
                yield row
    

    The above generator function will not yield a row until it is certain that there is no note continuing on the next line, effectively looking ahead.

    Now use the read_tabbed_file() function as a generator and loop over the results:

    for row in read_tabbed_file(yourfilename):
        # row is a list of elements
    

    Demo:

    >>> open('/tmp/test.csv', 'w').write('User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n')
    >>> for row in read_tabbed_file('/tmp/test.csv'):
    ...     print row
    ... 
    ['User Name', 'Code', 'Track', 'Color', 'Note']
    ['User Name2', 'Code2', 'Track2', 'Color2', 'Note2']