Search code examples
unixawksedgrepcut

Download web page and remove content except for one html table


I am given a large html report from another department quite regularly that requires a fair amount of manual work to edit into a format that is required.

I'd like to work a bit smarter. I can download the page via:

wget -qO- <https://the_page.html>

However I just want to carve out a table that begins:

<!-- START Daily Keystroke

It goes on and on for many lines of html and always ends:

</table>
</div>
</div>

Before the next load of data is begun. I need everything in between these patterns in one chunk of text /file.

I played around with sed and awk which I am not really familiar with but it seems without knowing how many lines are going to be in the file each time these tools are not appropriate for this task. It seems something that can work more on specific patterns is appropriate.

That being the case I can install other utilities potentially. If anyone has any experience of something that might work?


Solution

  • Suggesting gawk cutting on first multi-line record. Followed by sed, head trimming till <!-- ....

    gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" input.html |sed '0,/<!-- START Daily Keystroke/d'
    

    Or without intermediate file:

    wget -qO- <https://the_page.html>| \
    gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" | \
    sed '0,/<!-- START Daily Keystroke/d'
    

    This script, tested to work with provided sample text.

    gawk Explanation:

    The gawk script cuts input text in first occurrence of.

    </table>
    </div>
    </div>
    

    Aligned to the left margin.

    NR==1{print}

    Print gawk record number 1 only.

    The first record is identify by all text (many lines), terminated with matched pattern in RS variable.

    RS="</table>\n</div>\n</div>"

    A regular expression (RegExp) That match the gawk multi-line record separator.

    In case you want to include indenting whitespaces in the RegExp. Try:

              </table>
            </div>
          </div>
    

    RS="[[:space:]]*</table>[[:space:]]*\n[[:space:]]*</div>[[:space:]]*\n[[:space:]]*</div>"

    sed Explanation:

    Remove all line till first occurrence of RegExp <!-- START Daily Keystroke

    0,/<!-- START Daily Keystroke/

    sed lines range. Starting from line 0, till first line that match <!-- START Daily Keystroke/

    d

    Delete/ignore all lines in range.