Search code examples
pythonregexwildcardsnakemake

Enforcing wildcard constraints in expansion


I want to collect all files matching the regex ^fs_node\d+\.xyz$, but I don't know how to write the expansion so that the glob uses the constraint. Right now,

wildcard_constraints:
    nodeidx = "\d+",

rule all:
    input:
        expand("fs_node{i}.xyz",
               i=glob_wildcards("fs_node{nodeidx}.xyz").nodeidx)

produces output that also matches files with irc, which I don't want:

    input: fs_node37_irc.xyz, fs_node41_irc.xyz, fs_node32.xyz, fs_node10.xyz, fs_node43.xyz, fs_node2.xyz, fs_node30_irc.xyz, fs_node16.xyz, fs_node45.xyz, fs_node23_irc.xyz, fs_node2_irc.xyz, fs_node44_irc.xyz, fs_node33_irc.xyz, fs_node35.xyz, fs_node1.xyz, fs_node28_irc.xyz, fs_node42.xyz, fs_node15_irc.xyz, fs_node12_irc.xyz, fs_node35_irc.xyz, fs_node42_irc.xyz, fs_node44.xyz, fs_node31.xyz, fs_node17_irc.xyz, fs_node8_irc.xyz, fs_node43_irc.xyz, fs_node15.xyz, fs_node5_irc.xyz, ...

How does one properly enforce (global) wildcard constraints in expansions? It's global because also gets used in other locations.


Solution

  • Maybe glob_wildcards is not flexible enough. I would explicitly list all files, select those you want to keep with some regex, extract the variable part nodeidx and use that as wildcard. Not tested:

    import os
    import re
    
    listdir = os.listdir(os.getcwd())
    
    nodeidx = []
    for x in listdir:
        if re.match('^fs_node\d+\.xyz$', x):
            idx = re.sub('^fs_node', '', re.sub('\.xyz$', '', x))
            _ = int(idx) # sanity check
            nodeidx.append(idx)
    
    wildcard_constraints:
        nodeidx = '|'.join([re.escape(x) for x in nodeidx])
    
    rule all:
        input:
            expand("fs_node{nodeidx}.xyz", nodeidx= nodeidx)