Search code examples
pythonpython-3.xfasta

remove specific endline breaks in Python


I have a long fasta file and I need to format the lines. I tried many things but since I'm not much familiar python I couldn't solve exactly.

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I want them to look like:

>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

I've tried this:

a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
    if line[0:1] == ">":
        continue
    else:
        stripped_line = line.rstrip()
        string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)

But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you


Solution

  • A common arrangement is to remove the newline, and then add it back when you see the next record.

    # Use a context manager (with statement)
    with open("file.fasta", "r") as a_file:
        # Keep track of whether we have written something without a newline
        written_lines = False
        for line in a_file:
            # Use standard .startswith()
            if line.startswith(">"):
                if written_lines:
                    print()
                    written_lines = False
                print(line, end='')
            else:
                print(line.rstrip('\n'), end='')
                written_lines = True
        if written_lines:
            print()
    

    A common beginner bug is forgetting to add the final newline after falling off the end of the loop.

    This simply prints one line at a time and doesn't return anything. Probably a better design would be to collect and yield one FASTA record (header + sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that - BioPython seems to be the go-to solution for bioinformatics.