I currently have a python script which takes a file as a command-line argument, does what it needs to do, and then outputs that file with _all_ORF.fsa_aa
appended. I'd like to actually edit the file name rather than appending, but I am getting confused with variables. I'm not sure how I can actually do it when the file is a variable.
Here's an example of the command-line argument:
gL=genomeList.txt #Text file containing a list of genomes to loop through.
for i in $(cat ${gL}); do
#some other stuff ;
python ./find_all_ORF_from_getorf.py ${i}_getorf.fsa_aa ;
done
Here is some of the python script (find_all_ORF_from_getorf.py):
import re, sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
with open(f'{infile}_all_ORF.fsa_aa'.format(), "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
#do some stuff
print(f'{sequence.description}_ORF_from_position_{h.start()},\n{sequence.seq[h_start:]}',
file=file_object)
Currently, the oupt file is called Genome_file_getorf.fsa_aa_all_ORF.fsa_aa
.I'd like to remove the first fsa_aa
so that the output looks like this: Genome_file_getorf_all_ORF.fsa_aa
. How do I do this? I can't work out how to edit it.
I have had a look at the os.rename module, but that doesn't seem to be able to edit the variable name, just append to it.
Thanks,
J
Regarding your bash code, you might find useful the following snippet, I find it a little bit more readable and I tend to use it a lot when iterating over lines.
while read line; do
#some other stuff ;
python ./find_all_ORF_from_getorf.py ${line}_getorf.fsa_aa ;
done < genomeList.txt
Now regarding your question and your python code
import re, sys
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
At this point your infile will look like 'Genome_file_getorf.fsa_aa' One option is to split this string through the '.' and get the first item
name = infile.split('.')[0]
In case you know there might be several '.' in the file name, like 'Myfile.out.old' and you only want to get rid of the last extension
name = infile.rsplit('.',1)[0]
A third option, if you know that that all your files end with '.fsa_aa' you can just slice the string using negative indices. As '.fsa_aa' has 7 characters:
name = input[:-7]
These three options are based on the string methods of string handling in python, see more at the official python docs
outfile = f'{name}_all_ORF.fsa_aa'
# if you wrote f'{variable}' you don't need the ".format()"
# On the other hand you can do '{}'.format(variable)
# or even '{variable}'.format(variable=SomeOtherVariable)
with open(outfile, "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
#do some stuff
file_object.write(f'{sequence.description}_ORF_from_position_{h.start()},\n{sequence.seq[h_start:]}')
Another option is to use Path from the pathlib library I do suggest that you play a bit with this library. In this case you would have to do some other minor changes to the code:
import re, sys
from pathlib import Path # <- Here
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
infile = Path(sys.argv[1]) # <- Here
outfile = infile.stem + '_all_ORF.fsa_aa' # <- Here
# And if you want to use outfile as a path I would suggest instead
# outfile = infile.parent.joinpath(infile.stem)
with open(outfile, "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
#do some stuff
file_object.write(f'{sequence.description}_ORF_from_position_{h.start()},\n{sequence.seq[h_start:]}')
Finally as you have seen in both cases I have replaced the print statement with the file_object.write method, it is better practice to write to a file rather than to print to it.