Search code examples
bashunixawkstreamlarge-files

Stream filter large number of lines that are specified by line number from stdin


I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).

I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:

10
14
...
1499
15858

Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:

xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt

I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.

This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:

xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next } 
         (FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt

Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file


Solution

  • Keeping with OP's current idea:

    xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
    

    Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).

    Another way to do this (replaces all of the above code):

    awk '
    FNR==NR { nums[$1]; next }             # process first file
    FNR in nums                            # process subsequent file(s)
    ' select.txt <(xz -dcq huge.txt.xz)
    

    Comments removed and cut down to a 'one-liner':

    awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
    

    Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):

    awk '
    # process first file
    
    FNR==NR      { nums[$1]
                   maxFNR= ($1>maxFNR ? $1 : maxFNR)
                   next
                 }
    
    # process subsequent file(s):
    
    FNR > maxFNR { exit }
    FNR in nums
    ' select.txt <(xz -dcq huge.txt.xz)
    

    NOTES:

    • keeping in mind we're talking about scanning millions of lines of input ...
    • FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
    • if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
    • if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
    • net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime