Search code examples
pythonbioinformaticspipelinesnakemake

Snakemake: Is it possible to use directories as wildcards?


Hi I´m new in Snakemake and have a question. I want to run a tool to multiple data sets. One data set represents one tissue and for each tissue exists fastq files, which are stored in the respective tissue directory. The rough command for the tools is:

  python TEcount.py -rosette rosettefile -TE te_references -count result/tissue/output.csv -RNA <LIST OF FASTQ FILE FOR THE RESPECTIVE SAMPLE>          

The tissues shall be the wildcards. How can I do this? Below I have a first try that did not work.

import os                                                                        

#collect data sets                                                               
SAMPLES=os.listdir("data/rnaseq/")                                               


rule all:                                                                        
    input:                                                                       
        expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)                   

rule run_TEtools:                                                                
    input:                                                                       
        TEcount='scripts/TEtools/TEcount.py',                                    
        rosette='data/prepared_data/rosette/rosette',                            
        te_references='data/prepared_data/references/all_TE_instances.fa'        
    params:
        #collect the fastq files in the tissue directory                                                              
        fastq_files = os.listdir("data/rnaseq/{sample}")                         
    output:                                                                      
        'results/{sample}/TEtools.{sample}.output.csv'                           
    shell:                                                                       
        'python {input.TEcount} -rosette {input.rosette} -TE                     
{input.te_references} -count {output} -RNA {params.fastq_files}'

In the rule run_TEtools it does not know what the {sample} is.


Solution

  • A snakemake wildcard can be anything. It is basically just a string.
    There are some issues with the way you are trying to achieve what you want.

    Ok, here's how I would do it. Explanations follow:

    import os                                                                        
    
    #collect data sets
    # Beware no other directories or files (than those containing fastqs) should be in that folder                                                        
    SAMPLES=os.listdir("data/rnaseq/")                                               
    
    def getFastqFilesForTissu(wildcards):
        fastqs = list()
        # Beware no other files than fastqs should be there
        for s in os.listdir("data/rnaseq/"+wildcards.sample):
            fastqs.append(os.path.join("data/rnaseq",wildcards.sample,s))
        return fastqs
    
    rule all:                                                                        
        input:                                                                       
            expand("results/{sample}/TEtools.{sample}.output.csv", sample=SAMPLES)                   
    
    rule run_TEtools:                                                                
        input:                                                                       
            TEcount='scripts/TEtools/TEcount.py',                                    
            rosette='data/prepared_data/rosette/rosette',                            
            te_references='data/prepared_data/references/all_TE_instances.fa',
            fastq_files = getFastqFilesForTissu        
        output:                                                                      
            'results/{sample}/TEtools.{sample}.output.csv'                           
        shell:                                                                       
            'python {input.TEcount} -rosette {input.rosette} -TE {input.te_references} -count {output} -RNA {input.fastq_files}'
    

    First of all, your fastq file should be defined as inputs in order for snakemake to know that they are files and that if they are changed, rules must be rerun. It is quite bad practice to define input files as params. params are made for parameters, usually not for files.
    Second, your script file is defined as input. You have to be aware that everytime you modify it, rules will be rerun. Maybe that's what you want.

    I would use a defined function to get the fastq file in each directory. If you want to use a function (like os.listdir()), you can't use your wildcards directly. You have to inject it in the function as a python object. You can either define a function that will take one argument, a wildcard object containing all your wildcards, or use the lambda keyword (ex: input = lamdba wildcards: myFuntion(wildcards.sample)).
    Another problem you have is that os.listdir() returns a list of files without the relative path. Also beware that the order in which os.listdir() will return you fastq file is unknown. Maybe that doesn't matter for your command.