Search code examples
batch-fileautomationmarkup

Remove HTML MarkUp


I am automating a marking procedure for a python class. However, when I download the submissions online they include the html markup which the students may have inadvertently submitted their solutions in such as:

<!DOCTYPE html><html><head><meta charset="UTF-8"></head><body><p><span style="font-family:'courier new', courier, monospace;">print("Bob and Bill Tiling Solutions Inc.")</span></p>
<p><span style="font-family:'courier new', courier, monospace;">h=int(input("Height   (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">w=int(input("Width    (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">p=int(input("Cost ($/m^2):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">print("The total cost for this job: $" + str(h*w*p+20))</span></p>
<p> </p></body></html>

Is there any way I can remove the mark-up in batch so that all that is left is:

print("Bob and Bill Tiling Solutions Inc.")
h=int(input("Height   (m):"))
w=int(input("Width    (m):"))
p=int(input("Cost ($/m^2):"))
print("The total cost for this job: $" + str(h*w*p+20))

If there is a third-party utility that does this I would be happy to download it.

I have tried using regular expressions through findstr with no avail (My search string is "<[^>]*>" but I do not know how to use findstr to remove all results in the text file)

Any suggestions are welcome.


Solution

  • Here's a SED script (I use GNUSED) which I adapted from Eric Pement's SED One-liners:

    the sed line

    sed -f dehtml.sed yourfilename
    

    The file dehtml.sed

    :a
    s/<[^>]*>//g;/</N;//ba