Search code examples
bashmakefile

Makefile that works on file pairs


I am working on a Makefile to maintain a workflow. The main part of the workflow requires a programs that can work on pairs of files. I have a directory of the raw data with the raw file pairs. Each pair has a name followed by _1.txt or _2.txt. The first step of the program should run a QC pipeline that requires both of these pairs and produces new pairs. The command looks like this:

run_qc.sh -in1 file_1.txt -in2 file_2.txt -out1 file_1.qc.txt -out2 file_2.qc.txt

The second step takes the qc'd files and combines them into a final result like this:

run_analysis.sh file_1.qc.txt file_2.qc.txt > file_result.txt

I want the program to run for all pairs that have the same prefix (i.e., file in the above example is the prefix). The best I have come up with so far is to just look at one of the file pairs then assume that everything for the second works and to make a dummy rule that doesn't do anything to fill out the dependency graph. It looks like this:

RAW_DIR = data/raw
QC_DIR = data/qc
ALL_FILES = $(wildcard $(RAW_DIR)/*.txt)
QC_FILES = $(patsubst $(RAW_DIR)/%.txt, $(QC_DIR)/%.qc.txt, $(ALL_FILES))

qc : $(QC_FILES)

$(QC_DIR)/%_1.qc.txt : $(RAW_DIR)/%_1.txt
    bash run_qc.sh -in1 $< -in2 $(patsubst $(RAW_DIR)/%_1.txt, $(RAW_DIR)/%_2.txt, $<) -out1 $@ -out2 $(patsubst $(QC_DIR)/%_1.qc.txt, $(QC_DIR)/%_2.qc.txt, $@)
 
 
$(QC_DIR)/%_2.qc.txt : $(RAW_DIR)/%_2.txt
    @echo

This solution seems to work alright, but I think there is something that I am missing to make the workflow more maintainable without the extra useless target. I am also assuming, that once I have this step figured out, it would be a lot easier to combine the results into the final analysis step.

So, ultimately my question is, how can I make a single rule that is aware of these file pairs when I may have to run this on many file pairs?

Thanks in advance,


Solution

  • The following assumes that your make is GNU make. It also assumes that the final *_result.txt files go in the same directory as the qc files. Adapt the corresponding recipe if it is not the case.

    "This solution seems to work alright": it cannot work at all because your recipe does not translate into valid shell syntax. There are several issues with your Makefile:

    1. You apparently think that make recipes are written in "make language". This is not really the case and you cannot use any make construct you want in recipes. Recipes, after very few specific translations by make (the expansion), are handed to the "shell", which is sh by default (but this can be changed). Your recipe does not translate into valid sh syntax (RAW1 = $< should be RAW1=$<, for instance). So, you should see errors (e.g., RAW1: command not found) when the shell tries to execute your recipe.

    2. All lines of a recipe are executed by a different shell. You cannot assign a variable in a line and use it in another... unless you write your complete recipe as a single line of shell (using the ;, &&, ||,... shell operators to separate the commands). Note that you can still write your recipe on several lines, if you wish, with line continuation, that is, by adding a backslash at the end of each line except the last.

    3. make expands the recipes before passing them to the shell. So if you want a $ sign to be passed to the shell you must escape it ($$) such that, after expansion by make, it becomes $.

    4. patsubst takes 3 parameters, not two. But you don't need it. The $* automatic variable is your friend.

    5. If your make is GNU make there are ways to indicate that a single execution of a recipe produces more than one target: grouped targets. In pattern rules with several targets GNU make considers the targets as grouped targets, the recipe is executed only once to build all. With recent versions of GNU make you can even make this explicit in non-pattern rules:

    foo bar &: baz
        command-that-builds-foo-and-bar
    

    So, with GNU make, your Makefile should be something like:

    RAW_DIR   := data/raw
    QC_DIR    := data/qc
    PREFIXES  := $(patsubst $(RAW_DIR)/%_1.txt,%,$(wildcard $(RAW_DIR)/*_1.txt))
    QC_FILES  := $(patsubst %,$(QC_DIR)/%_1.qc.txt $(QC_DIR)/%_2.qc.txt,$(PREFIXES))
    TARGETS   := $(patsubst %,$(QC_DIR)/%_result.txt,$(PREFIXES))
    
    .PHONY: qc all
    all: $(TARGETS)
    qc: $(QC_FILES)
    
    $(QC_DIR)/%_1.qc.txt $(QC_DIR)/%_2.qc.txt: $(RAW_DIR)/%_1.txt $(RAW_DIR)/%_2.txt
        run_qc.sh -in1 $(RAW_DIR)/$*_1.txt -in2 $(RAW_DIR)/$*_2.txt -out1 $(QC_DIR)/$*_1.qc.txt -out2 $(QC_DIR)/$*_2.qc.txt
    
    $(QC_DIR)/%_result.txt: $(QC_DIR)/%_1.qc.txt $(QC_DIR)/%_2.qc.txt
        run_analysis.sh $^ > $@