I'm trying to use snakemake
to download a list of files, and then rename them according to mapping given in the file. I first read a dictionary from a file that has the form of {ID_for_download : sample_name}
, and I pass the list of its keys to first rule for download (because downloading is taxing, I'm just using a dummy script to generate empty files). For every file in the list, two files are downloaded in the form of {file_1.fastq}
and {file_2.fastq}
When those files are downloaded, I then rename them using the second rule - here I take advantage of being able to run python code in a rule using run
key word. When I do a dry-run using -n
flag, everything works. But when I do an actual run, I get an error of the form
Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]
What happens is that a directory to store my files is created, and then my files are "downloaded", and then are renamed. Then when it reaches the last file, I get this error and everything is deleted. The snakemake file is below:
import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"bash dummy_download.sh"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name=file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name=file.replace(old_name,sample_name)
os.rename(file,new_name)
I've separately tried to run download_srafiles
and it worked. I also separately tried to run rename_srafiles_to_samples
and it worked. But when I run those files in conjunction, I get the error. For completeness, the script dummy_download.sh
is below:
#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[@]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done
(linker.csv is a file in one column has ID_for_download
and in other column has sample_name
)
What am I doing wrong?
EDIT: Per user dariober, the change of directories via python
's os
in the rule rename_srafiles_to_samples
"confused" snakemake. Snakemake's logic is sound - if I change the directory to enter raw_samples
, it tries to find raw_samples
in itself and fails. To that extend, I tested different versions.
Exactly as dariober explained. Important bits of code:
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
It lists files in "raw_samples" directory, and then renames them. Crucial thing to do is to add prefix of directory (raw_samples/
) to each rename.
The same as my original post, but instead of leaving working directory, I exit it at the end of the loop. It works.
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
os.chdir("..")
Same as my original post, but instead of modifying anything in the run
segment, I modify the output
to exclude file directory. This means that I have to modify my rule all
too. It didn't work. Code is below:
rule all:
input:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
The error it gives is:
MissingOutputException in line 24
...
Job files missing
The files are actually there. So I don't know if I made some error in the code or is this some bug.
I wouldn't say that this is a problem with snakemake. It's more of a problem with my poorly thought out process. In retrospect, it makes perfect sense that entering directory messes up output/input process of snakemake. If I want to use os
module in snakemake to change directories, I have to be very careful. Enter wherever I need to, but ultimately go back to my original starting place. Many thanks to /u/dariober and /u/SultanOrazbayev
I think snakemake gets confused by os.chdir
. Your rule rename_srafiles_to_samples
creates the correct files and the input/output naming is fine. However, since you have changed directory snakemake cannot find the expected output. I'm not sure I'm correct in all this and if so if it is a bug... This version avoids os.chdir
and seems to work:
import csv
import os
SRA_MAPPING = {'SRR1': 'A', 'SRR2': 'B'}
SRAFILES = list(SRA_MAPPING.keys()) #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
# os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
(However, a more snakemake-ish solution would be to have a wildcard for the SRR id and have each rule executed once for each SRR id, basically avoiding expand
in download_srafiles
and rename_srafiles_to_samples
)