Search code examples
pythonreplacestrip

Unable to remove line breaks in a text file in python


At the risk of losing reputation I did not know what else to do. My file is not showing any hidden characters and I have tried every .replace and .strip I can think of. My file is UTF-8 encoded and I am using python/3.6.1 I have a file with the format:

 >header1
 AAAAAAAA
 TTTTTTTT
 CCCCCCCC
 GGGGGGGG

 >header2
 CCCCCC
 TTTTTT
 GGGGGG
 AAAAAA

I am trying to remove line breaks from the end of the file to make each line a continuous string. (This file is actually thousands of lines long). My code is redundant in the sense that I typed in everything I could think of to remove line breaks:

 fref = open(ref)
 for line in fref:
     sequence = 0
     header = 0
     if line.startswith('>'):
          header = ''.join(line.splitlines())
          print(header)
     else:
          sequence = line.strip("\n").strip("\r")
          sequence = line.replace('\n', ' ').replace('\r', '').replace(' ', '').replace('\t', '')
          print(len(sequence))

output is:

 >header1
 8
 8
 8
 8
 >header2
 6
 6
 6
 6

But if I manually go in and delete the end of line to make it a continuous string it shows it as a congruent string.

Expected output:

 >header1
 32
 >header2
 24     

Thanks in advance for any help, Dennis


Solution

  • There are several approaches to parsing this kind of input. In all cases, I would recommend isolating the open and print side-effects outside of a function that you can unit test to convince yourself of the proper behavior.

    You could iterate over each line and handle the case of empty lines and end-of-file separately. Here, I use yield statements to return the values:

    def parse(infile):
        for line in infile:
            if line.startswith(">"):
                total = 0
                yield line.strip()
            elif not line.strip():
                yield total
            else:
                total += len(line.strip())
        if line.strip():
            yield total
    
    def test_parse(func):
        with open("input.txt") as infile:
            assert list(parse(infile)) == [
                ">header1",
                32,
                ">header2",
                24,
            ]
    

    Or, you could handle both empty lines and end-of-file at the same time. Here, I use an output array to which I append headers and totals:

    def parse(infile):
        output = []
        while True:
            line = infile.readline()
            if line.startswith(">"):
                total = 0
                header = line.strip()
            elif line and line.strip():
                total += len(line.strip())
            else:
                output.append(header)
                output.append(total)
                if not line:
                    break
    
        return output
    
    def test_parse(func):
        with open("input.txt") as infile:
            assert parse(infile) == [
                ">header1",
                32,
                ">header2",
                24,
            ]
    

    Or, you could also split the whole input file into empty-line-separated blocks and parse them independently. Here, I use an output stream to which I write the output; in production, you could pass the sys.stdout stream for example:

    import re
    def parse(infile, outfile):
        content = infile.read()
        for block in re.split(r"\r?\n\r?\n", content):
            header, *lines = re.split(r"\s+", block)
            total = sum(len(line) for line in lines)
            outfile.write("{header}\n{total}\n".format(
                header=header,
                total=total,
            ))
    
    from io import StringIO
    def test_parse(func): 
        with open("/tmp/a.txt") as infile: 
            outfile = StringIO() 
            parse(infile, outfile) 
            outfile.seek(0) 
            assert outfile.readlines() == [ 
                ">header1\n", 
                "32\n", 
                ">header2\n", 
                "24\n", 
            ]
    

    Note that my tests use open("input.txt") for brevity but I would actually recommend passing a StringIO(...) instance instead to see the input being tested more easily, to avoid hitting the filesystem and to make the tests faster.