Search code examples
regexbashsedgrep

How can I extract the value of html TH tags that occur multiple times in a text file using a bash script?


I have a text file that contains html markup. I would like to extract the values in this section:

<th scope="col" class="text-center">158</th>
<th scope="col" class="text-center">139 (87.97%)</th>
<th scope="col" class="text-center">18 (11.39%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>
<th scope="col" class="text-center">1 (0.63%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>

The values change from time to time but there will always be only 6 of thesr tags. I've tried doing something like this:

text="$(cat email_resp.txt | grep -n '<th scope="col" class="text-center">' | sort)"

I also tried this as well:

text2="$(sed -n '/<th scope="col" class="text-center">/,/<\/th>/p' email_resp.txt)"

But what I get is like a "blob" of text and I'm not able to iterate over it.

689:                        <th scope="col" class="text-center">158</th>
690:                        <th scope="col" class="text-center">139 (87.97%)</th>
691:                        <th scope="col" class="text-center">18 (11.39%)</th>
692:                        <th scope="col" class="text-center">0 (0.00%)</th>
693:                        <th scope="col" class="text-center">1 (0.63%)</th>
694:                        <th scope="col" class="text-center">0 (0.00%)</th>

This is the output when I use the sed command:

<th scope="col" class="text-center">158</th>
<th scope="col" class="text-center">139 (87.97%)</th>
<th scope="col" class="text-center">18 (11.39%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>
<th scope="col" class="text-center">1 (0.63%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>

Ideally what I would like to do is extract those values between the <th> tags into an array or variables so that I can use them elsewhere.


Solution

  • #!/bin/bash
    
    
    source <(
            awk -F'<th scope="col" class="text-center">|</th>' '
                    BEGIN{print "declare -a myArr1=(" }
                    NF==3{print "\047"$2"\047"}
                    END{print ")"}
            ' file
    )
    
    declare -a myArr2="(
            $(
                    awk -F'<th scope="col" class="text-center">|</th>' '
                         NF==3{print "\047"$2"\047"}
                    ' file
            )
    )"
    
    declare -p myArr1
    declare -p myArr2
    

    declare -a myArr1=([0]="158" [1]="139 (87.97%)" [2]="18 (11.39%)" [3]="0 (0.00%)" [4]="1 (0.63%)" [5]="0 (0.00%)")
    declare -a myArr2=([0]="158" [1]="139 (87.97%)" [2]="18 (11.39%)" [3]="0 (0.00%)" [4]="1 (0.63%)" [5]="0 (0.00%)")