I was trying to created a snakemake script to automate 3 tasks. The first one edits a .seg file in order to be the correct input for the next rule, the second rule computes an analysis for transcription factors for each bam and it gives as output a folder with sample name and inside it another sub folder named 'TranscriptionFactors'. The hierarchy would be: output_folder |__ sampleID |__TranscriptionFactors |__ GTRD_ChIP_Only_1000sites |__ file 1 |__ file 2 |__ file ...
And finally, the last rule would receive as input output_folder + sampleID and also the sampleID (./scoring_pipeline.sh <output_folder_from_prev_step> <sample name>), and will give as output a folder named AccessibilityOutput, a sub folder with images and many txt files. output_folder |__ sampleID |TranscriptionFactors |GTRD_ChIP_Only_1000sites | file 1 | file 2 |__ file ... |AccessibilityOutput | {}Accessibility1KSitesAdjusted.txt |_ {}Accessibility1KSites.txt |_ file ... |WaveletOut | image 1 |__ image 2 |__ image ... |__ log.txt
import os
configfile: "config.yaml"
SAMPLES = [d for d in os.listdir(config["input_folder"]) if os.path.isdir(os.path.join(config["input_folder"], d))]
#print(SAMPLES)
rule all:
input:
expand(config["output_folder"] + "/{sampleID}", sampleID=SAMPLES)
rule format_to_cn_seg_files:
input:
seg=config["seg_folder"] + "/{sampleID}.seg"
output:
filtered_seg="filtered_segs/{sampleID}_filtered.seg"
shell:
"""
mkdir -p filtered_segs
head -n 10 {input.seg} # Preview first 10 lines
awk '{{print $2 "\\t" $3 "\\t" $4 "\\t" $NF}}' {input.seg} > {output.filtered_seg}
awk '!($NF ~ /^NA$/)' {output.filtered_seg} > {output.filtered_seg}.tmp && mv {output.filtered_seg}.tmp {output.filtered_seg}
sed -i '1d' {output.filtered_seg}
sed -i '1i chr\\tstart\\tend\\tLog2-ratio' {output.filtered_seg}
"""
rule run_tf_analyses_from_bam:
input:
bam=config["input_folder"] + "/{sampleID}/{sampleID}.srtd.markdup.bam",
bai=config["input_folder"] + "/{sampleID}/{sampleID}.srtd.markdup.bam.bai",
seg="filtered_segs/{sampleID}_filtered.seg"
output:
directory(config["output_folder"] + "/{sampleID}")
shell:
"""
mkdir -p {output}
python {config[path_to_scripts]}/run_tf_analyses_from_bam_ori.py \\
-b {input.bam} \\
-o {output} \\
-norm-file {input.seg} -hg38 -a tf_gtrd_1000sites
"""
rule calculate_accessibility:
input:
tf_results=config["output_folder"] + "/{sampleID}" # Directorio con resultados anteriores
output:
directory(config["output_folder"] + "/{sampleID}")
message: "Executing calculate_accessibility with {input}."
shell:
"""
./scoring_pipeline.sh {input.tf_results} {wildcards.sampleID}
"""
The issue is that when I try a run dry with snakemake, it only recognize rule format_to_cn_seg_files and run_tf_analyses_from_bam
This is the output in the terminal : snakemake -s snakemake_all_process.smk -n --debug-dag
Building DAG of jobs...
candidate job all
wildcards:
candidate job calculate_accessibility
wildcards: sampleID=04B003764_R
candidate job run_tf_analyses_from_bam
wildcards: sampleID=04B003764_R
candidate job format_to_cn_seg_files
wildcards: sampleID=04B003764_R
selected job format_to_cn_seg_files
wildcards: sampleID=04B003764_R
selected job run_tf_analyses_from_bam
wildcards: sampleID=04B003764_R
candidate job calculate_accessibility
wildcards: sampleID=VH_14B_019722_R
candidate job run_tf_analyses_from_bam
wildcards: sampleID=VH_14B_019722_R
candidate job format_to_cn_seg_files
wildcards: sampleID=VH_14B_019722_R
selected job format_to_cn_seg_files
wildcards: sampleID=VH_14B_019722_R
selected job run_tf_analyses_from_bam
wildcards: sampleID=VH_14B_019722_R
selected job all
wildcards:
Job stats:
job count
------------------------ -------
all 1
format_to_cn_seg_files 2
run_tf_analyses_from_bam 2
total 5
Reasons:
(check individual jobs above for details)
input files updated by another job:
all, run_tf_analyses_from_bam
output files have to be generated:
run_tf_analyses_from_bam
set of input files has changed since last execution:
format_to_cn_seg_files
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'.
Rules with provenance triggered jobs: format_to_cn_seg_files
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I try editing the rule all but it also gives error or folder dependency... (I don`t have the output..)
The first issue here is that you have two rules that produce the exact same output:
directory(config["output_folder"] + "/{sampleID}")
Normally I'd expect to see an "ambiguous rule" error but for some reason in this particular case, Snakemake is choosing the first rule that can produce the requested directory, which is run_tf_analyses_from_bam
.
Also, your calculate_accessibility
rule cannot work as this has the input and the output set to the exact same pattern. Each job that runs from a rule needs to create every item in its output, regardless of whether that item is a file or a directory, so you cannot ever have a rule with the output and input the same.
I think you are under the impression that a directory()
output means that the rule writes some files to that existing directory, but that's not the case. Any output directory will be removed (if it exists) by Snakemake prior to running the job and then the job itself must create the output directory. So in terms of building the DAG any directory()
output is no different than a regular file output.