Search code examples
gnu-parallel

How to keep only one header row in output processing test using python and gnu-parallel?


The Problem

Often I need to process a directory of several CSV files and produce a single output file. Frequently, I rely on GNU parallel to run these tasks concurrently. However, I need a way to discard the first row (header) for all but the first job that returns output.

To make this concrete, imagine a directory of several CSV files like this...

x,y
1,1.2
2,5.3
3,6.0

Then, there's some (Python) script, call it calc.py, that cleans the data or does calculations...

import csv
import math
import sys

rdr = csv.DictReader(sys.stdin)
wtr = csv.DictWriter(sys.stdout, fieldnames=['x', 'y', 'siny'])
wtr.writeheader()

for row in rdr:
    row['siny'] = math.sin(float(row['y']))
    wtr.writerow(row)

We can then process the data files in parallel with GNU parallel...

parallel --lb python calc.py '<' {} ::: $(ls -1 *.csv)

This, however, will produce multiple header rows. For example...

x,y,siny
1,1.2,0.9320390859672263
2,5.3,-0.8322674422239013
3,6.0,-0.27941549819892586
x,y,siny
4,7.2,0.7936678638491531
5,2.2,0.8084964038195901
6,0.9,0.7833269096274833

I am looking for a simple way (ideally an option to parallel) to only keep the first header line in the output. The stuff related to headers in the manual seems to be about inputs.

Possible Solutions

I see a few options, but I don't love any of them...

  1. Don't have calc.py output the header and instead echo a header before running parallel. The disadvantage is that the header must be known or we need to peek at the header by running something like python calc.py data1.csv | head -n 1 before running parallel.
  2. Save the output of each job to a separate file, then concatenate them ex post (e.g. with xsv, tail, sed, etc.), removing the header from all but the first. This has the disadvantage of having to manage additional files on disk and clean them up afterwards.
  3. Write another program that does this and pipe the results of parallel to that.
    • Seems CPU intensive to compare each line of output against the first line, and we know few records will match.
    • Assumes no valid data records match the header row.

What's the best way to solve this? Is there an option that tells parallel to ignore all but one header row from each job's output?


Solution

  • How about adjusting option 1:

    Make the program take two arguments: file jobnumber

    if jobnumber == 1:
      output header
    

    To guarantee that job 1 is printed first, use --keep-order:

    parallel --keep-order python calc.py {#} '<' {} ::: *.csv
    

    GNU Parallel will cache the output from the running jobs in /tmp to serialize the output, which may or may not be slower than --lb.

    In general you can do something like:

    parallel -k python 'calc.py < {} {= uq; $_ = seq()==1 ? "" : "| tail +2" =}' ::: *.csv
    

    uq is available since 20190722.

    You will be running a tail, so it can be slightly slower. On my system tail delivers 0.5 GB/s per core.