Search code examples
pythonsnakemake

conditional target in Snakemake


I have 3 files:

output/file_2.txt
output/file_4.txt
output/file_6.txt

Each one of these files have sample names inside, for example output/file2.txt looks like this, other files follow same logic:

sample1
sample2
sample3

Now I have a function that reads through all the files and stores in a list the sample names:

def extract_ind():
    l = []
    for n in [2,4,6]:
            with open(f'output/file_{n}.txt') as f:
                l.extend([line.strip() for line in f])
    return l

Then I have the default rule that collects everything and the rule that extracts samples and creates new files with sample names :

rule all:
    input: 
        expand('output/file_{n}_{sample}.vcf.gz',n=[2,4,6],sample=extract_ind())

rule extract_samples:
    input:
        vcf = 'muscle/file_{n}.vcf.gz'
    output:
        vcf_dir = 'output/file_{n}_{sample}.vcf.gz'
    shell:
        'bcftools query -l {input.vcf} | xargs -I {{}} bcftools view -O z -s {{}} -o {output.vcf_dir} {input}' 

The way this is designed currently, rule all will use itertools product to generate all combinations of sample names and files: For example file_2.txt will have results like this:

output/file_2_sample1.vcf.gz
output/file_2_sample2.vcf.gz
output/file_2_sample3.vcf.gz
output/file_2_sample4.vcf.gz -----> ops, don't want that `sample4` is coming from `file_4`

In order to overcome this issue I naively created an input function, but I guess this is not the way to go since input functions are used for inputs and not target outputs right ?

def use_ind(wildcards):
    l = []
    if wildcards.n == 2:
        with open(f'output/file_2.txt') as f:
                l.extend([line.strip() for line in f])
                return l
        
    elif wildcards.n == 4:
        with open(f'output/file_4.txt') as f:
                l.extend([line.strip() for line in f])
                return l
    elif wildcards.n == 6:
        with open(f'output/file_6.txt') as f:
                l.extend([line.strip() for line in f])
                return l

I ultimately would like to create some logic to have something like this as result:

output/file_2_sample1.vcf.gz
output/file_2_sample2.vcf.gz
output/file_2_sample3.vcf.gz
output/file_4_sample4.vcf.gz
output/file_4_sample6.vcf.gz
etc....

That is every file should only contain samples that are pertinent to the file itself. Hope to be clear, if not let me know and I will try to formulate better the question.


Solution

  • For the following, I will assume your output/file_{n}.txt are inputs, otherwise you will need to utilize checkpoints for the rule that creates them.

    Don't be overwhelmed with rules and expands. At the end, rule all's input is a list of files you want. You can make that list with any valid python. Let's say you don't know the values for n ahead of time, but you have a directory of files that have samples in them.

    def use_ind(n):
        return [line.strip() for line in open(f'output/file_{n}.txt')]
    
    n_values = glob_wildcards('output/file_{n}.txt').n
    
    target_files = [f'output/file_{n}_{sample}.vcf.gz'
                    for n in n_values
                    for sample in use_ind(n)]
    
    rule all:
        input: target_files
    

    You should also try to have the outputs of different rules go into different directories so wildcard matching is more likely to work how you expect.