Search code examples
htmlsedstripstripping

Strip HTML contents using SED


I am working on a task for which SED is the designated tool. The task is to strip the contents of any web page file (*.htm or *.html), and insert the desired data into a new file.

  • Everything before and including the <body> tag is to be removed.
  • Everything from and including the </body> tag is to be removed.

Below is one example, where <div> tags, and what's between them, is to be kept:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SED Challange</title>
</head>
<body style="background-color:black;"><div style="width:100%; height:150px; margin-top:150px; text-align:center">
<img src="pic.png" width="50" height="50" alt="Pic alt text" />
</div></body></html>

However, I'm having trouble with removing <body> and what comes before:

sed 's/.*body.*>//' ./index.html > ./index.html.nobody

Instead of the desired result, the two separate lines containing <body> and </body> are removed!

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>SED Challange</title>
</head>

<img src="pic.png" width="50" height="50" alt="Pic alt text" />

I can't see why even one would be. I appreciate any feedback.

Edit:

Thanks to SLePort, this is my complete script:

#!/bin/bash

#Search location as user provided argument.
target="$1"

#Recursive, case insensitive search for file extension like htm(l).
hit=$(find $target -type f -iname '*.htm' -or -iname '*.html')

for h in $hit
do
    hp=$(realpath $h) #Absolute path of file (hit path).
    echo "Stripping performed on $hp" #Informing what file(s) found.
    nobody="${hp}_nobody" #File to contain desired data ending with "_nobody".

    #Remove file contents from start to and including head-tag, 
    #Remove body-tag,
    #Remove end html-tag,
    #Removee blank lines,
    #Insert data from file to file_nobody.
    sed '1,/<\/head>/d;s/<\/*body[^>]*>//g;s/<\/html>//;/^$/d' $h > $nobody 
done

Solution

  • This sed should work with the given code:

    sed '1,/<\/head>/d;s/<\/*body[^>]*>//g;s/<\/html>//' ./index.html > ./index.html.nobody
    

    It removes :

    • lines from line 1 to </head> tag
    • <body> and </body> tags
    • </html> closing tag

    But note that sed is NOT for parsing html files. Use an xml parser instead (eg: xmllint, XMLStarlet,...)