Search code examples
pythonfasta

Using variable as part of name of new file in python


I'm fairly new to python and I'm having an issue with my python script (split_fasta.py). Here is an example of my issue:

list = ["1.fasta", "2.fasta", "3.fasta"]
for file in list:
    contents = open(file, "r")
    for line in contents:
        if line[0] == ">":
            new_file = open(file + "_chromosome.fasta", "w")
            new_file.write(line)

I've left the bottom part of the program out because it's not needed. My issue is that when I run this program in the same direcoty as my fasta123 files, it works great:

python split_fasta.py *.fasta

But if I'm in a different directory and I want the program to output the new files (eg. 1.fasta_chromsome.fasta) to my current directory...it doesn't:

python /home/bin/split_fasta.py /home/data/*.fasta

This still creates the new files in the same directory as the fasta files. The issue here I'm sure is with this line:

new_file = open(file + "_chromosome.fasta", "w")

Because if I change it to this:

new_file = open("seq" + "_chromosome.fasta", "w")

It creates an output file in my current directory.

I hope this makes sense to some of you and that I can get some suggestions.


Solution

  • You are giving the full path of the old file, plus a new name. So basically, if file == /home/data/something.fasta, the output file will be file + "_chromosome.fasta" which is /home/data/something.fasta_chromosome.fasta

    If you use os.path.basename on file, you will get the name of the file (i.e. in my example, something.fasta)

    From @Adam Smith

    You can use os.path.splitext to get rid of the .fasta

    basename, _ = os.path.splitext(os.path.basename(file))
    

    Getting back to the code example, I saw many things not recommended in Python. I'll go in details.

    Avoid shadowing builtin names, such as list, str, int... It is not explicit and can lead to potential issues later.

    When opening a file for reading or writing, you should use the with syntax. This is highly recommended since it takes care to close the file.

    with open(filename, "r") as f:
        data = f.read()
    with open(new_filename, "w") as f:
        f.write(data)
    

    If you have an empty line in your file, line[0] == ... will result in a IndexError exception. Use line.startswith(...) instead.

    Final code :

    files = ["1.fasta", "2.fasta", "3.fasta"]
    for file in files:
        with open(file, "r") as input:
            for line in input:
                if line.startswith(">"):
                    new_name = os.path.splitext(os.path.basename(file)) + "_chromosome.fasta"
                    with open(new_name, "w") as output:
                        output.write(line)
    

    Often, people come at me and say "that's hugly". Not really :). The levels of indentation makes clear what is which context.