Search code examples
bashhtml-parsing

How to remove repeated HTML elements except first one?


I have an HTML file with some repeated text along the document. The repeated strings have font size 4 or 5 and my goal is to delete all those repeated strings except the first appeareance.

For example:

India! with size=5 appears 9 times and with size=4 appears 2 times. Then I'd like to remove all appeareances of India with size=5 and leave the first.

India!

I've tried with sed command in bash (I'm open to suggestions to do it with other tools) doing as below, but doesn't work because removes everything after the first match:

sed 's/<font size=\"[4-5]\".*<\/font>//g'

and I get as output only this:

<!DOCTYPE html> <html> <body> 
<h1>Some header</h1> 
<p>  </p> 
<p> This is other text. </p> 
</body>
</html>

My input file is this:

<!DOCTYPE html>
<html>
<body>

<h1>Some header</h1>

    <p>
    <font size="5">India!</font>
        <p>
        <font size="4">Japan!</font>
        </p>
    </p>
    <p>Some text 1</p>
            <p>
                <font size="5">India!</font>
        </p>
    <p>Some text 2</p>
    <p>
            <font size="5">India!</font>
        <p>
            <font size="4">Japan!</font>
            </p>
        </p>
    <p>Some text 3</p>
        <p>
        <font size="5">Uganda!</font>
        </p>
    <p>Some text 4</p>
    <p>
        <font size="5">India!</font>
        <p>
        <font size="4">Japan!</font>
        </p>
        </p>
    <p>Some text 5</p>
        <p>
            <font size="5">India!</font>
        </p>
    <p>Some text 6</p>
        <p>
            <font size="5">Cameroon!</font>
        </p>
    <p>Some text 7</p>
        <p>
                <font size="4">India!</font>
        </p>
    <p>Some text 8</p>
        <p>
            <font size="5">India!</font>
        </p>
    <p>Some text 9</p>
        <p>
            <font size="5">India!</font>
        </p>
    <p>Some text 10</p>
    <p>
        <font size="5">Pakistan!</font>
    </p>
    <p>Some text 11</p>
    <p>
            <font size="5">Pakistan!</font>
    </p>
    <p>Some text 12</p>
    <p>
            <font size="5">India!</font>
        </p>
    <p>Some text 13</p>
        <p>
                <font size="4">Uganda!</font>
        </p>
        <p>
        <font size="5">India!</font>
    </p>
    <p>Some text 14</p>
    <p>
        <font size="4">India!</font>
    </p>

    <p> This is other text. </p>

    </body>
    </html>

I show in image below the input(to the left) and output desired(to the rigth) in text format and HTML preview.

enter image description here


Solution

  • As you requested in your comment, here is a slightly different program to remove the associated paragraph tags as well.

    In order to remove the <p> and </p> before and after the lines you want removed ( the duplicates ), I found it conceptually easier to run through the file twice.

    The first pass through the file, I keep track of whether or not I've seen the combination of font size and country just as before. In addition, I also track the line numbers (FNR) of the lines that need to be removed. The code "knows" the first pass through the file when NR == FNR. NR is total number of records so far and FNR is the record number in the file. Thus, when they are equal, awk is parsing the first file.

    In the second pass through the same input file, I print out the current record if it is not marked as suppressed. The FNR is used to index the suppressed array because FNR is the same in the first pass as the second pass of the file.

    Lastly, in order to tell awk to parse the file twice, we'll need to pass the input file to awk twice on the command line.

    Here's the revised code. I also illustrate how to parse your input file twice by adding the file (let's call it input.html) two times to the command line:

    awk -F"[\"<>= ]*" '
    NR == FNR { 
      if ( $2 == "font" )
      {
        if (seen[ $4,$5 ] ) 
          suppress[ NR - 1 ] = suppress[ NR ] = suppress[ NR + 1 ]  = 1
    
        seen[$4,$5] = 1 
      }
      next 
    } 
    ! suppress[ FNR ] 
    ' input.html input.html