Search code examples
linuxbashawksedgnu-parallel

Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel


I have a file, say "Line_File" with a list of line start & end numbers and file ID :

F_a 1 108
F_b 109 1210
F_c 131 1190

I have another file, "Data_File" from where I need to fetch all the lines between the line numbers fetched from the Line_File.

The command in sed:

'sed -n '1,108p' Data_File > F_a.txt 

does the job but I need to do this for all the values in columns 2 & 3 of Line_File and save it with the file name mentioned in the column 1 of the Line_File.

If $1, $2 and $3 are the three cols of Line_File then I am looking for a command something like

'sed -n '$2,$3p' Data_File > $1.txt

I can run the same using Bash Loop but that will be very slow for a very large file, say 40GB.

I specifically want to do this because I am trying to use GNU Parallel to make it faster and line number based slicing will make the output non-overlapping. I am trying to execute command like this

cat Data_File | parallel -j24 --pipe --block 1000M --cat LC_ALL=C sed -n '$2,$3p' > $1.txt

But I am no able to actually use the column assignment $1,$2 and $3 properly.

I tried the following command:

awk '{system("sed -n \""$2","$3"p\" Data_File > $1"NR)}' Line_File

But it doesn't work. Any idea where I am going wrong?

P.S If my question is not clear then please point out what else I should be sharing.


Solution

  • You may use xargs with -P (parallel) option:

    xargs -P 8 -L 1 bash -c 'sed -n "$2,$3p" Data_File > $1.txt' _ < Line_File
    

    Explanation:

    • This xargs command takes Line_File as input by using <
    • -P 8 option allows it to run up to 8 processes in parallel
    • -L 1 makes xargs process one line at a time
    • bash -c ... forks bash for each line in input file
    • _ before < passes _ as $0 and passes remaining 3 column in each input line as $1, $2,$3`
    • sed -n runs sed command for each line by forming a command line

    Or you may use gnu parallel like this:

    parallel --colsep '[[:blank:]]' "sed -n '{2},{3}p' Data_File > {1}.txt" :::: Line_File
    

    Check parallel examples from official doc