Search code examples
fileawksedfile-ioio

Is it possible to extract certain rows from a file where the first column contains an identifier that matches the names of other files?


I asked this as part of a larger question yesterday. A commenter suggested splitting it off into its own question.

I have x files. The contents of the first five files appear below. On the second row, i is a counter so that i = 0 for the first file. Next to i is time and it always increases by 0.5 when i increases by 1.

However, I have only kept every xth file. So while i = 0 for the first file that I have, i = 100 in the second file, and so on.

6 # This file is called "0.xyz" (<--the "6" is the same in all files)
i =       0, time =        0.000, k =      9000000000000
X        -7.6415350292        6.0494971539        8.1919697993
Y        -6.6418362233        5.9231018862        8.4056822626
Y        -8.0518670684        6.3158684817        9.0061271154
X        26.8252967820       20.4661074967       17.8025744066
Y        26.4477411207       20.4071029058       16.9121571912
Y        26.4399648474       21.2950722068       18.1009273227

6 # This file is called "100.xyz"
i =       100, time =        50.000, k =      2500000000000
X        -6.2423192714       -1.5704681396       -9.5648670474
Y        -5.4925100813       -1.6522059045       -8.9030589772
Y        -6.7765278574       -2.3616512405       -9.4776648590
X         4.1248924594       27.8487302083      -17.5400886312
Y         4.1238657681       26.9869907778      -17.9727402579
Y         5.0750649402       28.1292768156      -17.6848507559

6 # This file is called "200.xyz"
i =       200, time =        100.000, k =      3945000000000
X        19.0090162215       -5.9338939011        6.1931167954
Y        18.4748060757       -6.4905073540        5.6656446036
Y        19.2825591449       -6.4479943255        7.0179774953
X        11.0203415273       34.6029396705        2.7220660957
Y        11.1184002007       34.8398120338        1.8089008500
Y        10.3349649622       33.9509485292        2.5605794622

6 # This file is called "300.xyz"
i =       300, time =       150.000, k =      2341000000000
X        -7.6415350292        6.0494971539        8.1919697993
Y        -6.6418362233        5.9231018862        8.4056822626
Y        -8.0518670684        6.3158684817        9.0061271154
X        26.8252967820       20.4661074967       17.8025744066
Y        26.4477411207       20.4071029058       16.9121571912
Y        26.4399648474       21.2950722068       18.1009273227

6 # This file is called "400.xyz"
i =       400, time =        200.000, k =      2500000000000
X        -6.2423192714       -1.5704681396       -9.5648670474
Y        -5.4925100813       -1.6522059045       -8.9030589772
Y        -6.7765278574       -2.3616512405       -9.4776648590
X         4.1248924594       27.8487302083      -17.5400886312
Y         4.1238657681       26.9869907778      -17.9727402579
Y         5.0750649402       28.1292768156      -17.6848507559

What I would like to do is match these files (above) with data from another file (below). In the file below, each row matches one file above according to the i (Step). Then I would like to print the first three columns of the matching rows in the file below to a new file.

Although I have included file names, I would prefer to do the matching by using i, not the file names.

I know how to do this by simple iteration. What is mean is, I can count and print every xth row of the file below to a new file. However, I would like to use a more sophisticated approach that specifically matches i because these are very long files and there could potentially be rows missing so that I'd end up with a mismatch between the files above and the one I want here.

    #   Step       Time        Ax                   Ay                  Az                  Bx                  By                  Bz                  Cx                  Cy                  Cz                     Final 
           0       0.000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290          3243.9033438318
           1       0.500       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502          3243.7307919244
           2       1.000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814          3243.5615395313
           3       1.500       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604          3243.3955453870
           4       2.000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354          3243.2327751298
...
           100       50.000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290        0.0000000000        0.0000000000        0.0000000000       14.8032123290          3243.9033438318
...
           200       100.000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502        0.0000000000        0.0000000000        0.0000000000       14.8029498502          3243.7307919244
...
           300       150.000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814        0.0000000000        0.0000000000        0.0000000000       14.8026923814          3243.5615395313
           301       150.500       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604        0.0000000000        0.0000000000        0.0000000000       14.8024398604          3243.3955453870
...
           400       200.000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354        0.0000000000        0.0000000000        0.0000000000       14.8021922354          3243.2327751298

Example of what I would like the result to be from manipulating the file above to match the set of example files at the top of the question:

   0         0.000         14.8032123290        
   100       50.000        14.8032123290
   200       100.000       14.8029498502
   300       150.000       14.8026923814
   400       200.000       14.8021922354

If anyone has any hints on how to approach this, I would be grateful.


Solution

  • You can use an awk script to do this as follows:

    awk 'FNR == 1 {
    if ($0 ~ /^i =/) {
        dataFile = 0;
        step[$3+0] = FILENAME;
    }
    else {dataFile = 1;}
    }
    
    dataFile == 1 && step[$1] {
        print $1, $2, $3;
    }' *.xyz data.txt
    

    (assumes the final file is called data.txt; change as necessary)

    FNR == 1 matches the first line of every file and will either capture the step from the xyz file, or set a flag indicating that we've reached the data file. The $3 + 0 bit will just force awk to convert the 3rd field to a number (i.e., drop the trailing comma) due to the request to perform a mathematical operation.

    dataFile == 1 && step[$i] matches lines within the data file where the step value was seen in an xyz file.

    NOTE: You must specify all the xyz files before the final data file, so that all the steps are collected before the data file is processed.


    Sorry, when I put the above solution together, I thought the # This file is called lines were not part of the file. Modified script is here:

    awk '
    FNR == 2 && FILENAME != ARGV[ARGC-1] {
        step[$3+0] = FILENAME;
    }
    
    FILENAME == ARGV[ARGC-1] && step[$1] {
        print $1, $2, $3;
    }' *.xyz data.txt
    
    

    This version uses ARGV & ARGC to determine whether the "data" file is being processed.

    • If not the data file and line number == 2, cache the "step" value
    • If the data file, and the step is in the list, print the first 3 fields

    Result:

    0 0.000 14.8032123290
    100 50.000 14.8032123290
    200 100.000 14.8029498502
    300 150.000 14.8026923814
    400 200.000 14.8021922354