Search code examples
snakemake

Snakemake always rebuilds targets, even when up to date


I'm new to snakemake and running into some behavior I don't understand. I have a set of fastq files with file names following the standard Illumina convention:

SAMPLENAME_SAMPLENUMBER_LANE_READ_001.fastq.gz

In a directory reads/raw_fastq. I'd like to create symbolic links to simplify the names to follow the pattern:

SAMPLENAME_READ.fastq.gz

In a directory reads/renamed_raw_fastq

My aim is that as I add new fastq files to the project, snakemake will create symlinks only for the newly-added files.

My snakefile is as follows:

# Get sample names from read file names in the "raw" directory

readRootDir = 'reads/'
readRawDir = readRootDir + 'raw_fastq/'

import os

samples = list(set([x.split('_', 1)[0] for x in os.listdir(readRawDir)]))
samples.sort()

# Generate simplified names

readRenamedRawDir = readRootDir + 'renamed_raw_fastq/'

newNames = expand(readRenamedRawDir + "{sample}_{read}.fastq.gz", sample = samples, read = ["R1", "R2"])

# Create symlinks

import glob

def getRawName(wildcards):
    rawName = glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0]
    return rawName

rule all:
    input: newNames 

rule rename:
    input: getRawName
    output: "reads/renamed_raw_fastq/{sample}_{read}.fastq.gz"
    shell: "ln -sf {input} {output}"

When I run snakemake, it tries to generate the symlinks as expected but:

  1. Always tries to create the target symlinks, even when they already exist and have later timestamps than the source fastq files.

  2. Throws errors like:

MissingOutputException in line 68 of /work/nick/FAW-MIPs/renameRaw.snakefile:
Missing files after 5 seconds:
reads/renamed_raw_fastq/Ben21_R2.fastq.gz
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.

It's almost like snakemake isn't seeing the ouput files it creates. Can anyone suggest what I might be missing here?

Thanks!


Solution

  • I think

    ln -sf {input} {output}
    

    gives a symlink pointing to a missing file, i.e., it doesn't point to the source file. You could fix it by e.g. using absolute paths, like:

    def getRawName(wildcards):
        rawName = os.path.abspath(glob.glob(readRawDir + wildcards.sample + "_*_" + wildcards.read + "_001.fastq.gz")[0])
        return rawName
    

    (As an aside, I would make sure that renaming fastq files the way you do doesn't result in a name-collision, for example when the same sample is sequenced on different lanes of the same flow cell.)