snakemake how to use glob_wilcards properly

I have many paired fastq files and I have a problem on after running trim_galore package, as it named the fastq files with _1_val_1 and _2_val_2, for example: AD50_CTGATCGTA_1_val_1.fq.gz and AD50_CTGATCGTA_2_val_2.fq.gz.

I would like continue snakemake and use

import os
import snakemake.io
import glob

DIR="AD50"
(SAMPLES,READS,) = glob_wildcards(DIR+"{sample}_{read}.fq.gz")
READS=["1","2"]

rule all:
    input:
        expand(DIR+"{sample}_dedup_{read}.fq.gz",sample=SAMPLES,read=READS)

rule clumpify:
    input:
        r1=DIR+"{sample}_1_val_1.fq.gz",
        r2=DIR+"{sample}_2_val_2.fq.gz"
    output:
        r1out=DIR+"{sample}_dedup_1.fq.gz",
        r2out=DIR+"{sample}_dedup_2.fq.gz"
    shell:
        "clumpify.sh in={input.r1} in2={input.r2} out={output.r1out} out2={output.r2out} dedupe subs=0"

and the error is:

Building DAG of jobs...
MissingInputException in line 13 of /home/peterchung/Desktop/Rerun-Result/clumpify.smk:
Missing input files for rule clumpify:
AD50/AD50_CTGATCGTA_2_val_2_val_2.fq.gz
AD50/AD50_CTGATCGTA_2_val_1_val_1.fq.gz

I tired another way, somehow the closest is that it detected the missing input like AD50_CTGATCGTA_1_val_2.fq.gz and AD50_CTGATCGTA_2_val_1.fq.gz which is not exist.

I am not sure the glob_wildcards function I used properly since there are many underscore in it. I tired:

 glob_wildcards(DIR+"{sample}_{read}_val_{read}.fq.gz")

but it did not work as well.

Solution

Glob wildcards is effectively a wrapper for applying a regular expression to the list of directories. By default, wildcards will match to .* greedily. You need to specify your wildcards more specifically and make sure your rule input matches the same pattern matching.

Going through your example:

AD50_CTGATCGTA_2_val_2.fq.gz
^ {sample}         ^ ^{read}

The {sample} wildcard consumes everything until the regex will no longer match, up to the val. {read} is then left with just 2.

In rule all, you then request {sample}_dedup_{read}.fq.gz, which is {AD50_CTGATCGTA_2_val}_dedup_{2}.fq.gz (leaving in curly braces to show where wildcards are). When that is matched to clumpify, you request as input: {AD50_CTGATCGTA_2_val}_2_val_2.fq.gz, which is why you are missing that file.

To address, you have a few options:

If sample should contain the 1_val part, then you need to update the input for clumpify to match your existing filenames (remove the extra _2_val, etc)
If sample should only contain AD50_CTGATCGTA, build a more specific filename. Consider adding wildcard constraints to limit your matches, e.g. [\d]+ for read. This seems to be what you are getting at in the last example.

Finally, the expand function by default applies the product of the supplied iterables. That's why you are getting AD50_CTGATCGTA_1_val_2.fq.gz, for example. You need to add zip as the second argument to override that default and match the order of wildcards returned from glob_wildcards. See here as well.