Search code examples
pythonbashzcat

Single zcat multiple extracts with IDs arrays


I have many GB+ size gz archives I can not decompress for disk space reasons. Each archive has one specific identification number (example test365.gz) and a structure like this:

         1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM
      1 C1          5.7064    -2.3998   -12.0246 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000099999999
@<TRIPOS>MOLECULE
 ZINC000099999999      none
@<TRIPOS>ATOM
      1 C1         -2.0084    -5.2055   -12.9609 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077402345
@<TRIPOS>MOLECULE
 ZINC000077402345     none
@<TRIPOS>ATOM
      1 C1          6.5657    -1.5531   -15.3414 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM
      1 C1          3.6696    -1.8305   -14.6766 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000012345678
@<TRIPOS>MOLECULE
 ZINC000012345678      none
@<TRIPOS>ATOM
      1 C1          4.5368    -0.8182   -17.4314 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407100
@<TRIPOS>MOLECULE
 ZINC000077407100      none
@<TRIPOS>ATOM
      1 C1          1.4756    -2.2562   -14.0852 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM
      1 C1          6.1712    -0.8991   -16.4096 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM

The number of lines between the ###### defined block is variable.

I have a list of identifiers for ZINC entities + target archive:

test365/    ZINC000077407198
test227/    ZINC000009100000
test365/    ZINC000077407100
... 

Currently I do:

zcat test365.gz | sed -n '/##########                 Name:     ZINC000077407100/,/##########                 Name:/p' > ZINC000077407100.out

and I get:

##########                 Name:     ZINC000077407100
@<TRIPOS>MOLECULE
 ZINC000077407100      none
@<TRIPOS>ATOM
      1 C1          1.4756    -2.2562   -14.0852 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198

Which works fine. If there are N blocks for ZINC000077407100 I extract N blocks upon zcat and do not mind about the line with starting with #####.

The problem is I need to read the archive N times for the N identifiers / ZINC_NUMBER I want the information for. And it takes a lot of time since I have thousands to extract.

So I would like to find a way to pass an array or list of identifiers / ZINC_NUMBER to output the zcat reading to several different files in function of the identifiers in the array / list.

In other words I would like to do single read using zcat and extract data for a set of identifiers and not only one.

Thanks for your help!


Solution

  • Per OP the requirement is to process large volume of data (millions of rows, multiple GB of data, and the need to retrieve data about 100's of items). Technically possible to do with modern bash, but it unlikely that this will perform well. A better scripting engine will do much better here.

    Possible bash/awk solution presented here. It will scan each referenced file once, adn extract all the selected tags with a single pass. Note that the 'tags' lists will be scanned multiple times, but it is implied it's size is reasonable

    #! /bin/bash -uex
    TAGS=data.txt
    
    file_list=$(awk '{ print $1 }' < $TAGS | sort -u)
    
    for f in $file_list ;
    do
            gz_name=${f%/}.gz
            zcat $gz_name | awk -v F=$f '
            # Remember tags to retrieve
    !DATA && $1 == F { tags[$2] = 1 }
            # OUT set to current output file, empty if item not selected
    DATA && $1 == "##########" && $2 == "Name:" {
            OUT = tags[$3] ? $3 ".out" : "" ;
    }
    OUT { print >OUT }
    ' $TAGS DATA=1 -
    done
    

    Needless to say, possible to write the above 5 liner awk job with Python, Perl, Javascript, or your favorite text processing tool. Tested with the sample data set.