Search code examples
xmlawksedgreppcregrep

Get a block from an XML file using data from a source file


I revamped this question since I've been reading a bit on XML.

I have a file source file that contains a list of AuthNumbers. 111222 111333 111444 etc.

I need to search for the numbers in that list and find them in a corresponding XML file. In the xml file the line is formatted as such: <trpcAuthCode>111222</trpcAuthCode>

This can be achieved quite painlessly using grep however I require the entire block containing the transaction.

The block starts with: <trans type="network sale" recalled="false"> or <trans type="network sale" recalled="false" rollback="true"> and/or some other variations. Actually <trans*> would be best if something like that is possible.

The block ends with </trans>

It doesn't need to be elegant or efficient. I just need it to work. I suspect some transactions are dropping out and I need a quick way to vet the ones that are not being processed.

If it helps here is a link to the original (sterilized) xml https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0

And what I would like to extract: https://www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml?dl=0

The size of each result will vary as each transaction can vary greatly in length depending on the amount of products purchased. In the results xml you see that I extracted the xml I need based on the trpcAuthCode list 111222,111333,111444.


Solution

  • Concerning XML and awk questions, you often find comments of the gurus (the one if a k in their reputation) that XML processing in awk is complicated or not sufficient. As I understood the question, the script is needed for personal and/or debugging purposes. For this, my solution should be sufficient but, please, keep in mind that it will not work on any legal XML file.

    Based on your description, the sketch of the script is:

    1. If <trans*> is matched start recording.

    2. If <trpcAuthCode> is found get its contents and compare with the list. In case of match, remember block for output.

    3. If </trans> is matched stop recording. If output has been enabled print recorded block otherwise discard it.

    Because I did something similar in SO: Shell scripting - split xml into multiple files this should become not too hard to implmenent.

    Though, one additional feature is necessary: feeding the AuthNumbers array into the script. Due to a surprising coincidence, I learnt the answer just this morning in SO: How to access an array in an awk, which is declared in a different awk in shell? (thanks to the comment of jas).

    So, putting it altogether in a script filter-trpcAuthCode.awk:

    BEGIN {
      record = 0 # state for recording
      buffer = "" # buffer for recording
      found = 0 # state for found auth code
      # build temp. array from authCodes which has to be pre-defined
      split(authCodes, list, "\n")
      # build final array where values become keys
      for (i in list) authCodeList[list[i]]
      # for debugging: output of authCodeList
      print "<!-- authCodeList:"
      for (authCode in authCodeList) {
        print authCode
      }
      print "-->"
    }
    
    /<trans( [^>]*)?>/ {
      record = 1 # start recording
      buffer = "" # clear buffer
      found = 0 # reset state for found auth code
    }
    
    record {
      buffer = buffer"\n"$0 # record line (if recording is enabled)
    }
    
    record && /<trpcAuthCode>/ {
      # extract auth code
      authCode = gensub(/^.*>([^<]*)<\/trpcAuthCode.*$/, "\\1", "g")
      # check whether auth code in authCodeList
      found = authCode in authCodeList
    }
    
    /<\/trans>/ {
      record = 0 # stop recording
      # print buffer if auth code has been found
      if (found) {
        print buffer
      }
    }
    

    Notes:

    1. I struggled initially when applying the split() on authCodes in BEGIN. This makes an array where the split values are stored with enumerated keys. Thus, I looked for a solution to make the values itself keys of the array. (Otherwise, the in operator cannot be used for search.) I found an elegant solution in the accepted answer of SO: Check if array contains value.

    2. I implemented the proposed pattern <trans*> as /<trans( [^>]*)?/ which will even match <trans> (although <trans> seems never to occur without attributes) but not <transSet>.

    3. The
      buffer = buffer"\n"$0
      appends the current line to the previous contents. The $0 contains the line without the newline character. Thus, it has to be re-inserted. How I did it, the buffer starts with a newline but the last line ends without. Considering that the print buffer adds a newline at the end of text this is fine for me. Alternatively, the above snippet could be replaced by
      buffer = buffer $0 "\n"
      or even
      buffer = (buffer != "" ? buffer"\n" : "") $0.
      (It's a matter of taste.)

    4. The filtered file is simply printed to standard output channel. It might be redirected to a file. Considering this, I formatted the additional/debug output as XML comment.

    5. If your are a little bit familiar with awk you may notice that there isn't any next statement in my script. This is by intention. In other words, the order of rules is well-chosen so that a line may be processed/affected consecutively by all rules. (I tested an extreme case:
      <trans><trpcAuthCode>111222</trpcAuthCode></trans>
      and even this is processed correctly.)

    To simplify testing I added a wrapper bash script filter-trpcAuthCode.sh

    #!/usr/bin/bash
    # uncomment next line for debugging
    #set -x
    # check command line arguments
    if [[ $# -ne 2 ]]; then
      echo "ERROR: Illegal number of command line arguments!"
      echo ""
      echo "Usage:"
      echo $(basename $0) " XML_FILE AUTH_CODES"
      exit 1
    fi
    # call awk script
    awk -v authCodes="$(cat <$2)" -f filter-xml-trpcAuthCode.awk "$1"
    

    I tested the scripts (with bash in cygwin on Windows 10) against your sample file main.xml and got four matching blocks. I was a little bit concerned about the output because in your sample output transaction_results.xml are only three matching blocks. But checking my output visually it seems to be appropriate. (All four hits contained a matching <trpcAuthCode> element.)

    I reduced your sample input a little bit for demonstration sample.xml:

    <?xml version="1.0"?>
    <transSet periodID="1" periodname="Shift" longId="2017-04-27" shortId="052" site="12345">
      <trans type="periodClose">
        <trHeader>
        </trHeader>
      </trans>
      <printCashier>
        <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
      </printCashier>
      <trans type="printCashier">
        <trHeader>
          <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier>
          <posNum>101</posNum>
        </trHeader>
      </trans>
      <trans type="journal">
        <trHeader>
        </trHeader>
      </trans>
      <trans type="network sale" recalled="false">
        <trHeader>
          <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
        </trHeader>
        <trPaylines>
          <trPayline type="sale" sysid="1" locale="DOLLAR">
            <trpCardInfo>
              <trpcAccount>1234567890123456</trpcAccount>
              <trpcAuthCode>532524</trpcAuthCode>
           </trpCardInfo>
          </trPayline>
        </trPaylines>
      </trans>
      <trans type="network sale" recalled="false">
        <trHeader>
          <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
        </trHeader>
        <trPaylines>
          <trPayline type="sale" sysid="1" locale="DOLLAR">
            <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
            <trpAmt>61.77</trpAmt>
            <trpCardInfo>
              <trpcAccount>2345678901234567</trpcAccount>
              <trpcAuthCode>111222</trpcAuthCode>
            </trpCardInfo>
          </trPayline>
        </trPaylines>
      </trans>
      <trans type="periodClose">
        <trHeader>
          <date>2017-04-27T23:50:17-04:00</date>
        </trHeader>
      </trans>
      <endTotals>
        <insideSales>445938.63</insideSales>
      </endTotals>
    </transSet>
    

    For the other sample input I simply copied the text into a file authCodes.txt:

    111222
    111333
    111444
    

    Using both input files in the sample session:

    $ ./filter-xml-trpcAuthCode.sh
    ERROR: Illegal number of command line arguments!
    
    Usage:
    filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES
    
    $ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt
    <!-- authCodeList:
    111222
    111333
    111444
    -->
    
      <trans type="network sale" recalled="false">
        <trHeader>
          <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN>
        </trHeader>
        <trPaylines>
          <trPayline type="sale" sysid="1" locale="DOLLAR">
            <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode>
            <trpAmt>61.77</trpAmt>
            <trpCardInfo>
              <trpcAccount>2345678901234567</trpcAccount>
              <trpcAuthCode>111222</trpcAuthCode>
            </trpCardInfo>
          </trPayline>
        </trPaylines>
      </trans>
    
    $ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt >output.txt
    
    $
    

    The last command re-directs output to a file output.txt which may be inspected or processed afterwards.