Search code examples
bashxargs

how to sort a huge number of files with an obscure program that only outputs their ordering


A colleague of mine wanted to run a FORTRAN program that takes files arguments and outputs their ordering (best first) against some biophysicochemical criterion. What he needed were the 10 best results.

While the files are not big, the problem is that he got a bash: /home/progs/bin/ardock: Argument list too long, so I created 6 digits long symlinks to the files and used them as argument, which worked ;-)

Now, if the number of files is really too huge for the above trick to work then what can you possibly do to get the 10 best out of all of them? Do you have to sort the files by chunk and compare the bests against the bests with something like this?

#!/bin/bash

best10() { ardock "$@" | head -n 10; }
export -f best10

find . -name '*.dat' -exec bash -c 'best10 "$@"' _ {} + |
xargs bash -c 'best10 "$@"' _ |
xargs bash -c 'best10 "$@"' _ |
xargs bash -c ... | ... | ...

The problem here is that the number of required xargs is not known in advance, so how can you make it a loop?

note: As the program is outputting a linefeed-delimited stream of filepaths, I know that xargs can potentially break. Don't worry about it here, you can consider the filenames to be alphanumeric.


Solution

  • Maybe something like this (untested):

    #!/usr/bin/env bash
    
    best10() { ardock "$@" | head -n 10; }
    export -f best10
    
    readarray -t files < <(find . -name '*.dat' -exec bash -c 'best10 "$@"' _ {} +)
    while (( ${#files[@]} > 10 )); do
        readarray -t files < <(printf '%s\n' "${files[@]}" | xargs bash -c 'best10 "$@"' _)
    done
    printf '%s\n' "${files[@]}"
    

    That assumes your file names don't contain newlines since your existing calls to head and xargs would fail if they did. It also assumes you're using the shell builtin printf, rather than an external version of it, so it won't have an ARG_MAX issue.