Search code examples
bashxargs

How to use xargs to output to different file names?


Say I have a large amount of files in a list, like this

$ mkdir inputs
$ for i in $(seq 1 1 10000); do printf "$i\n" > inputs/$i; done
$ find inputs/ -type f -exec readlink -f {} \; > files.txt

and I want to pass them all through a script that looks like this

$ cat script.py
#!/usr/bin/env python3
import sys
args = sys.argv[1:]
output_file = args[0]
input_files = args[1:]
text = "got {} files".format(len(input_files))
print(text)
with open(output_file, "w") as fout:
    fout.write(text + '\n')

I cannot pass them all at once, because the command line invocation would be too large for the system to handle. However, xargs is able to take care of this for you;

The command line for command is built up until it reaches a system-defined limit (unless the -n and -L options are used). The specified command will be invoked as many times as necessary to use up the list of input items. In general, there will be many fewer invocations of command than there were items in the input. This will normally have significant performance benefits. Some commands can usefully be executed in parallel too; see the -P option.

You can see this in action like this;

$ cat files.txt | xargs ./script.py output.txt
got 2151 files
got 2152 files
got 2152 files
got 2152 files
got 1393 files

Here, xargs has broken up the command into 5 separate commands and ran each one.

However, the output file will have only the contents of the last invocation;

$ cat output.txt
got 1393 files

What I want instead, is to get output files that look like this;

output1.txt # got 2151 files
output2.txt # got 2152 files
output3.txt # got 2152 files
output4.txt # got 2152 files
output5.txt # got 1393 files

There is a question here that suggests accomplishing this inside the script. However, my script script.py cannot do this itself, because it has no knowledge of the fact that it has been run n number of times on batched input sets. And in real life, myscript.py might actually be any arbitrary 3rd party program that I cannot modify to accomplish something like that.

So it would be easier if I could just use some kind of argument to xargs that would automatically fill in the number n of the batches that have been processed, such as

$ cat files.txt | xargs ./script.py output.{n}.txt

Does something like this exist? Is there some method to fill in the command arguments with the incremented number of batches that xargs has chunked the input into?


Solution

  • Here is another method I have found that uses GNU parallel instead of xargs;

    $ parallel -a files.txt --xargs ./script.py output.{#}.txt {}
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 631 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    
    $ ls -1 output.*
    output.10.txt
    output.1.txt
    output.2.txt
    output.3.txt
    output.4.txt
    output.5.txt
    output.6.txt
    output.7.txt
    output.8.txt
    output.9.txt
    
    $ cat output.*
    got 631 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files
    got 1041 files