I have never written Makefiles before, but I suspect that it would be helpful in my situation. I have a corpus of text files that I need to preprocess to extract features for machine learning. The directory structure could be like this:
/
+---Makefile
+---/corpus
| +-- a.txt
| +-- b.txt
| +-- ...
|
+---/wordcounts
| +-- a.wordcount
| +-- b.wordcount
| +-- ...
|
+---/lettercounts
| +-- a.lettercount
| +-- b.lettercount
| +-- ...
|
...
The files in /wordcounts
and /lettercounts
are generated from the files in /corpus
. For just the file a.txt
, I can write make
dependencies like this:
all: wordcounts/a.wordcount lettercounts/a.lettercount
wordcounts/a.wordcount: corpus/a.txt
cat corpus/a.txt | wc -w > wordcounts/a.wordcount
lettercounts/a.lettercount: corpus/a.txt
cat corpus/a.txt | wc -m > lettercounts/a.lettercount
However, with thousands of files in \corpus
this Makefile will become extremely long. I want to write a Makefile that will adjust to whatever files are in \corpus
. The idea is that no matter how many files I put in /corpus
, the Makefile will automatically make all the other files. How can I do this? Is this what automake
is for?
Background Currently, I use a number of scripts to generate large csv
files, and running all of the scripts for the whole corpus takes a couple hours. I need to restructure so that changes in one file will not necessitate reprocessing the whole corpus. I welcome any suggestions for how to set up the project more efficiently, if what I am suggesting is not ideal.
Here's one way to accomplish this
corpora := $(wildcard corpus/*.txt)
wordcounts := $(corpora:corpus/%.txt=wordcounts/%.wordcount)
lettercounts := $(corpora:corpus/%.txt=lettercounts/%.lettercount)
.PHONY: all
all: $(wordcounts) $(lettercounts)
$(wordcounts): wcflags += -w
$(wordcounts): wordcounts/%.wordcount: corpus/%.txt
$(lettercounts): wcflags += -m
$(lettercounts): lettercounts/%.lettercount: corpus/%.txt
$(wordcounts) $(lettercounts):
cat $< | wc $(wcflags) > $@
Run make
with the -r
flag to disable the builtin rules for maximum performance.