Search code examples
parallel-processinggrepgnu-parallel

How to use GNU parallel with data and grep?


admin@admin:~$ grep -F -o 's' <<< "SDSDdsds" 
s
s
admin@admin:~$ echo "SDSDdsds" | grep -F 's'
s
s
admin@admin:~$ parallel grep -F -o 's' <<< "SDSDdsds" 
grep: SDSDdsds: No such file or directory
admin@admin:~$ parallel grep -F -o 's' ::: "SDSDdsds" 
grep: SDSDdsds: No such file or directory
admin@admin:~$ parallel grep -F -o 's' <<< ::: "SDSDdsds" 
grep: SDSDdsds: No such file or directory
grep: :::: No such file or directory
admin@admin:~$ echo "SDSDdsds" | parallel grep -F -o 's'
grep: SDSDdsds: No such file or directory
admin@admin:~$ parallel echo "SDSDdsds" | grep -F -o 's'
parallel: Warning: Input is read from the terminal. You are either an expert
parallel: Warning: (in which case: YOU ARE AWESOME!) or maybe you forgot
parallel: Warning: ::: or :::: or -a or to pipe data into parallel. If so
parallel: Warning: consider going through the tutorial: man parallel_tutorial
parallel: Warning: Press CTRL-D to exit.

How to make this work with grep? Doesn't seem to work correctly with grep. Also problems passing data in other programming languages into GNU parallel when data isn't simple numbers or strings.


Solution

  • GNU Parallel cannot parallelize on a single input.

    But let us say you have a 1 GB file that you want to run grep ATTACACAT on.

    parallel --pipepart -a my1gb.file --block -1 grep ATTACACAT
    

    This will (on-the-fly) split my1gb.file into 1 file per CPU thread, start grep ATTACACAT on each CPU thread, collect the output and serialize this when each grep finishes.

    If the input is in a variable/string:

    echo "$var" | parallel --pipe grep 2345672
    

    But unless $var is really big (>1GB) this is unlikely to give any speedup: GNU Parallel has an overhead of 150 ms startup + 3 ms per job.