I have a long fasta file and I need to format the lines. I tried many things but since I'm not much familiar python I couldn't solve exactly.
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I want them to look like:
>seq1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>seq2
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
I've tried this:
a_file = open("file.fasta", "r")
string_without_line_breaks = ""
for line in a_file:
if line[0:1] == ">":
continue
else:
stripped_line = line.rstrip()
string_without_line_breaks += stripped_line
a_file.close()
print(string_without_line_breaks)
But the result not showing ">" lines and also merging all other lines. Hope you can help me about it. Thank you
A common arrangement is to remove the newline, and then add it back when you see the next record.
# Use a context manager (with statement)
with open("file.fasta", "r") as a_file:
# Keep track of whether we have written something without a newline
written_lines = False
for line in a_file:
# Use standard .startswith()
if line.startswith(">"):
if written_lines:
print()
written_lines = False
print(line, end='')
else:
print(line.rstrip('\n'), end='')
written_lines = True
if written_lines:
print()
A common beginner bug is forgetting to add the final newline after falling off the end of the loop.
This simply prints one line at a time and doesn't return anything. Probably a better design would be to collect and yield
one FASTA record (header + sequence) at a time, probably as an object. and have the caller decide what to do with it; but then, you probably want to use an existing library which does that - BioPython seems to be the go-to solution for bioinformatics.