Search code examples
makefilecorpustagged-corpus

Makefile for a LARGE number of files


I have never written Makefiles before, but I suspect that it would be helpful in my situation. I have a corpus of text files that I need to preprocess to extract features for machine learning. The directory structure could be like this:

/
+---Makefile
+---/corpus
|   +-- a.txt
|   +-- b.txt
|   +-- ...
|
+---/wordcounts
|   +-- a.wordcount
|   +-- b.wordcount
|   +-- ...
|
+---/lettercounts
|   +-- a.lettercount
|   +-- b.lettercount
|   +-- ...
|
...

The files in /wordcounts and /lettercounts are generated from the files in /corpus. For just the file a.txt, I can write make dependencies like this:

all: wordcounts/a.wordcount lettercounts/a.lettercount

wordcounts/a.wordcount: corpus/a.txt
    cat corpus/a.txt | wc -w > wordcounts/a.wordcount

lettercounts/a.lettercount: corpus/a.txt
    cat corpus/a.txt | wc -m > lettercounts/a.lettercount

However, with thousands of files in \corpus this Makefile will become extremely long. I want to write a Makefile that will adjust to whatever files are in \corpus. The idea is that no matter how many files I put in /corpus, the Makefile will automatically make all the other files. How can I do this? Is this what automake is for?

Background Currently, I use a number of scripts to generate large csv files, and running all of the scripts for the whole corpus takes a couple hours. I need to restructure so that changes in one file will not necessitate reprocessing the whole corpus. I welcome any suggestions for how to set up the project more efficiently, if what I am suggesting is not ideal.


Solution

  • Here's one way to accomplish this

    corpora      := $(wildcard corpus/*.txt)
    wordcounts   := $(corpora:corpus/%.txt=wordcounts/%.wordcount)
    lettercounts := $(corpora:corpus/%.txt=lettercounts/%.lettercount)
    
    .PHONY: all
    all: $(wordcounts) $(lettercounts)
    
    $(wordcounts): wcflags += -w
    $(wordcounts): wordcounts/%.wordcount: corpus/%.txt
    
    $(lettercounts): wcflags += -m
    $(lettercounts): lettercounts/%.lettercount: corpus/%.txt
    
    $(wordcounts) $(lettercounts):
        cat $< | wc $(wcflags) > $@
    

    Run make with the -r flag to disable the builtin rules for maximum performance.