Search code examples
pythonbashbiopython

How do I change part of a file name when it is a variable in python?


I currently have a python script which takes a file as a command-line argument, does what it needs to do, and then outputs that file with _all_ORF.fsa_aa appended. I'd like to actually edit the file name rather than appending, but I am getting confused with variables. I'm not sure how I can actually do it when the file is a variable.

Here's an example of the command-line argument:

gL=genomeList.txt   #Text file containing a list of genomes to loop through.             

for i in $(cat ${gL}); do
    #some other stuff ; 
    python ./find_all_ORF_from_getorf.py ${i}_getorf.fsa_aa ; 
    done

Here is some of the python script (find_all_ORF_from_getorf.py):

import re, sys

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

infile = sys.argv[1]

with open(f'{infile}_all_ORF.fsa_aa'.format(), "a") as file_object:
    for sequence in SeqIO.parse(infile, "fasta"):
       #do some stuff
       print(f'{sequence.description}_ORF_from_position_{h.start()},\n{sequence.seq[h_start:]}', 
       file=file_object)

Currently, the oupt file is called Genome_file_getorf.fsa_aa_all_ORF.fsa_aa.I'd like to remove the first fsa_aa so that the output looks like this: Genome_file_getorf_all_ORF.fsa_aa. How do I do this? I can't work out how to edit it.

I have had a look at the os.rename module, but that doesn't seem to be able to edit the variable name, just append to it.

Thanks,

J


Solution

  • Regarding your bash code, you might find useful the following snippet, I find it a little bit more readable and I tend to use it a lot when iterating over lines.

    while read line; do
        #some other stuff ; 
        python ./find_all_ORF_from_getorf.py ${line}_getorf.fsa_aa ; 
    done < genomeList.txt
    

    Now regarding your question and your python code

    import re, sys 
    
    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    
    infile = sys.argv[1]
    

    At this point your infile will look like 'Genome_file_getorf.fsa_aa' One option is to split this string through the '.' and get the first item

    name = infile.split('.')[0]
    

    In case you know there might be several '.' in the file name, like 'Myfile.out.old' and you only want to get rid of the last extension

    name = infile.rsplit('.',1)[0]
    

    A third option, if you know that that all your files end with '.fsa_aa' you can just slice the string using negative indices. As '.fsa_aa' has 7 characters:

    name = input[:-7] 
    

    These three options are based on the string methods of string handling in python, see more at the official python docs

    outfile = f'{name}_all_ORF.fsa_aa' 
    # if you wrote f'{variable}' you don't need the ".format()"
    # On the other hand you can do '{}'.format(variable)
    # or even '{variable}'.format(variable=SomeOtherVariable)
    
    with open(outfile, "a") as file_object:
        for sequence in SeqIO.parse(infile, "fasta"):
           #do some stuff
           file_object.write(f'{sequence.description}_ORF_from_position_{h.start()},\n{sequence.seq[h_start:]}')
    

    Another option is to use Path from the pathlib library I do suggest that you play a bit with this library. In this case you would have to do some other minor changes to the code:

    import re, sys
    from pathlib import Path # <- Here
    
    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    
    infile = Path(sys.argv[1]) # <- Here
    outfile = infile.stem + '_all_ORF.fsa_aa' # <- Here 
    # And if you want to use outfile as a path I would suggest instead
    # outfile = infile.parent.joinpath(infile.stem)
    
    with open(outfile, "a") as file_object:
        for sequence in SeqIO.parse(infile, "fasta"):
           #do some stuff
           file_object.write(f'{sequence.description}_ORF_from_position_{h.start()},\n{sequence.seq[h_start:]}')
    

    Finally as you have seen in both cases I have replaced the print statement with the file_object.write method, it is better practice to write to a file rather than to print to it.