I have many GB+ size gz archives I can not decompress for disk space reasons. Each archive has one specific identification number (example test365.gz) and a structure like this:
1 1 2 1
########## Name: ZINC000077407198
@<TRIPOS>MOLECULE
ZINC000077407198 none
@<TRIPOS>ATOM
1 C1 5.7064 -2.3998 -12.0246 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000099999999
@<TRIPOS>MOLECULE
ZINC000099999999 none
@<TRIPOS>ATOM
1 C1 -2.0084 -5.2055 -12.9609 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000077402345
@<TRIPOS>MOLECULE
ZINC000077402345 none
@<TRIPOS>ATOM
1 C1 6.5657 -1.5531 -15.3414 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000077407198
@<TRIPOS>MOLECULE
ZINC000077407198 none
@<TRIPOS>ATOM
1 C1 3.6696 -1.8305 -14.6766 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000012345678
@<TRIPOS>MOLECULE
ZINC000012345678 none
@<TRIPOS>ATOM
1 C1 4.5368 -0.8182 -17.4314 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000077407100
@<TRIPOS>MOLECULE
ZINC000077407100 none
@<TRIPOS>ATOM
1 C1 1.4756 -2.2562 -14.0852 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000077407198
@<TRIPOS>MOLECULE
ZINC000077407198 none
@<TRIPOS>ATOM
1 C1 6.1712 -0.8991 -16.4096 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000077407198
@<TRIPOS>MOLECULE
ZINC000077407198 none
@<TRIPOS>ATOM
The number of lines between the ###### defined block is variable.
I have a list of identifiers for ZINC entities + target archive:
test365/ ZINC000077407198
test227/ ZINC000009100000
test365/ ZINC000077407100
...
Currently I do:
zcat test365.gz | sed -n '/########## Name: ZINC000077407100/,/########## Name:/p' > ZINC000077407100.out
and I get:
########## Name: ZINC000077407100
@<TRIPOS>MOLECULE
ZINC000077407100 none
@<TRIPOS>ATOM
1 C1 1.4756 -2.2562 -14.0852 C.3 1 LIG1 -0.1500
@<TRIPOS>BOND
1 1 2 1
########## Name: ZINC000077407198
Which works fine. If there are N blocks for ZINC000077407100 I extract N blocks upon zcat and do not mind about the line with starting with #####.
The problem is I need to read the archive N times for the N identifiers / ZINC_NUMBER I want the information for. And it takes a lot of time since I have thousands to extract.
So I would like to find a way to pass an array or list of identifiers / ZINC_NUMBER to output the zcat reading to several different files in function of the identifiers in the array / list.
In other words I would like to do single read using zcat and extract data for a set of identifiers and not only one.
Thanks for your help!
Per OP the requirement is to process large volume of data (millions of rows, multiple GB of data, and the need to retrieve data about 100's of items). Technically possible to do with modern bash, but it unlikely that this will perform well. A better scripting engine will do much better here.
Possible bash/awk solution presented here. It will scan each referenced file once, adn extract all the selected tags with a single pass. Note that the 'tags' lists will be scanned multiple times, but it is implied it's size is reasonable
#! /bin/bash -uex
TAGS=data.txt
file_list=$(awk '{ print $1 }' < $TAGS | sort -u)
for f in $file_list ;
do
gz_name=${f%/}.gz
zcat $gz_name | awk -v F=$f '
# Remember tags to retrieve
!DATA && $1 == F { tags[$2] = 1 }
# OUT set to current output file, empty if item not selected
DATA && $1 == "##########" && $2 == "Name:" {
OUT = tags[$3] ? $3 ".out" : "" ;
}
OUT { print >OUT }
' $TAGS DATA=1 -
done
Needless to say, possible to write the above 5 liner awk job with Python, Perl, Javascript, or your favorite text processing tool. Tested with the sample data set.