Search code examples
pythonsnakemake

Using snakemake to rename files according to defined mapping


I'm trying to use snakemake to download a list of files, and then rename them according to mapping given in the file. I first read a dictionary from a file that has the form of {ID_for_download : sample_name}, and I pass the list of its keys to first rule for download (because downloading is taxing, I'm just using a dummy script to generate empty files). For every file in the list, two files are downloaded in the form of {file_1.fastq} and {file_2.fastq} When those files are downloaded, I then rename them using the second rule - here I take advantage of being able to run python code in a rule using run key word. When I do a dry-run using -n flag, everything works. But when I do an actual run, I get an error of the form

Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]

What happens is that a directory to store my files is created, and then my files are "downloaded", and then are renamed. Then when it reaches the last file, I get this error and everything is deleted. The snakemake file is below:

import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
    input:
        expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
        expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
    output:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    shell:
        "bash dummy_download.sh"
rule rename_srafiles_to_samples:
    input:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    output:
        expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
        expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
    run:
        os.chdir(os.getcwd()+r"/raw_samples")
        for file in os.listdir():
                old_name=file[:file.find("_")]
                sample_name=SRA_MAPPING[old_name]
                new_name=file.replace(old_name,sample_name)
                os.rename(file,new_name)

I've separately tried to run download_srafiles and it worked. I also separately tried to run rename_srafiles_to_samples and it worked. But when I run those files in conjunction, I get the error. For completeness, the script dummy_download.sh is below:

#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[@]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done

(linker.csv is a file in one column has ID_for_download and in other column has sample_name)

What am I doing wrong?

EDIT: Per user dariober, the change of directories via python's os in the rule rename_srafiles_to_samples "confused" snakemake. Snakemake's logic is sound - if I change the directory to enter raw_samples, it tries to find raw_samples in itself and fails. To that extend, I tested different versions.

Version 1

Exactly as dariober explained. Important bits of code:

for file in os.listdir('raw_samples'):
     old_name= file[:file.find("_")]
     sample_name=SRA_MAPPING[old_name]
     new_name= file.replace(old_name,sample_name)
     os.rename('raw_samples/' + file, 'raw_samples/' + new_name)

It lists files in "raw_samples" directory, and then renames them. Crucial thing to do is to add prefix of directory (raw_samples/) to each rename.

Version 2

The same as my original post, but instead of leaving working directory, I exit it at the end of the loop. It works.

os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
     old_name= file[:file.find("_")]
     sample_name=SRA_MAPPING[old_name]
     new_name= file.replace(old_name,sample_name)
     os.rename(file,new_name)
os.chdir("..")

Version 3

Same as my original post, but instead of modifying anything in the run segment, I modify the output to exclude file directory. This means that I have to modify my rule all too. It didn't work. Code is below:

rule all:
input:
    expand("{samples}_1.fastq",samples=SAMPLES),
    expand("{samples}_2.fastq",samples=SAMPLES),

rule download_srafiles:
    output:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    shell:
        "touch {output}"

rule rename_srafiles_to_samples:
    input:
        expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
        expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
    output:
        expand("{samples}_1.fastq",samples=SAMPLES),
        expand("{samples}_2.fastq",samples=SAMPLES)
    run:
        os.chdir(os.getcwd()+r"/raw_samples")
        for file in os.listdir():
             old_name= file[:file.find("_")]
             sample_name=SRA_MAPPING[old_name]
             new_name= file.replace(old_name,sample_name)
             os.rename(file,new_name)

The error it gives is:

MissingOutputException in line 24
...
Job files missing

The files are actually there. So I don't know if I made some error in the code or is this some bug.

Conclusion

I wouldn't say that this is a problem with snakemake. It's more of a problem with my poorly thought out process. In retrospect, it makes perfect sense that entering directory messes up output/input process of snakemake. If I want to use os module in snakemake to change directories, I have to be very careful. Enter wherever I need to, but ultimately go back to my original starting place. Many thanks to /u/dariober and /u/SultanOrazbayev


Solution

  • I think snakemake gets confused by os.chdir. Your rule rename_srafiles_to_samples creates the correct files and the input/output naming is fine. However, since you have changed directory snakemake cannot find the expected output. I'm not sure I'm correct in all this and if so if it is a bug... This version avoids os.chdir and seems to work:

    import csv
    import os
    
    SRA_MAPPING = {'SRR1': 'A', 'SRR2': 'B'}
    SRAFILES = list(SRA_MAPPING.keys()) #list of sra files
    SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
    
    rule all:
        input:
            expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
            expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
    
    rule download_srafiles:
        output:
            expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
            expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
        shell:
            "touch {output}"
    
    rule rename_srafiles_to_samples:
        input:
            expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
            expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
        output:
            expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
            expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
        run:
            # os.chdir(os.getcwd()+r"/raw_samples")
    
            for file in os.listdir('raw_samples'):
                 old_name= file[:file.find("_")]
                 sample_name=SRA_MAPPING[old_name]
                 new_name= file.replace(old_name,sample_name)
                 os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
    

    (However, a more snakemake-ish solution would be to have a wildcard for the SRR id and have each rule executed once for each SRR id, basically avoiding expand in download_srafiles and rename_srafiles_to_samples)