Often I need to process a directory of several CSV files and produce a single output file. Frequently, I rely on GNU parallel to run these tasks concurrently. However, I need a way to discard the first row (header) for all but the first job that returns output.
To make this concrete, imagine a directory of several CSV files like this...
x,y
1,1.2
2,5.3
3,6.0
Then, there's some (Python) script, call it calc.py
, that cleans the data or does calculations...
import csv
import math
import sys
rdr = csv.DictReader(sys.stdin)
wtr = csv.DictWriter(sys.stdout, fieldnames=['x', 'y', 'siny'])
wtr.writeheader()
for row in rdr:
row['siny'] = math.sin(float(row['y']))
wtr.writerow(row)
We can then process the data files in parallel with GNU parallel...
parallel --lb python calc.py '<' {} ::: $(ls -1 *.csv)
This, however, will produce multiple header rows. For example...
x,y,siny
1,1.2,0.9320390859672263
2,5.3,-0.8322674422239013
3,6.0,-0.27941549819892586
x,y,siny
4,7.2,0.7936678638491531
5,2.2,0.8084964038195901
6,0.9,0.7833269096274833
I am looking for a simple way (ideally an option to parallel
) to only keep the first header line in the output. The stuff related to headers in the manual seems to be about inputs.
I see a few options, but I don't love any of them...
calc.py
output the header and instead echo a header before running parallel. The disadvantage is that the header must be known or we need to peek at the header by running something like python calc.py data1.csv | head -n 1
before running parallel
.xsv
, tail
, sed
, etc.), removing the header from all but the first. This has the disadvantage of having to manage additional files on disk and clean them up afterwards.parallel
to that.
What's the best way to solve this? Is there an option that tells parallel to ignore all but one header row from each job's output?
How about adjusting option 1:
Make the program take two arguments: file jobnumber
if jobnumber == 1:
output header
To guarantee that job 1 is printed first, use --keep-order
:
parallel --keep-order python calc.py {#} '<' {} ::: *.csv
GNU Parallel will cache the output from the running jobs in /tmp to serialize the output, which may or may not be slower than --lb
.
In general you can do something like:
parallel -k python 'calc.py < {} {= uq; $_ = seq()==1 ? "" : "| tail +2" =}' ::: *.csv
uq
is available since 20190722.
You will be running a tail
, so it can be slightly slower. On my system tail
delivers 0.5 GB/s per core.