I am automating a marking procedure for a python class. However, when I download the submissions online they include the html markup which the students may have inadvertently submitted their solutions in such as:
<!DOCTYPE html><html><head><meta charset="UTF-8"></head><body><p><span style="font-family:'courier new', courier, monospace;">print("Bob and Bill Tiling Solutions Inc.")</span></p>
<p><span style="font-family:'courier new', courier, monospace;">h=int(input("Height (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">w=int(input("Width (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">p=int(input("Cost ($/m^2):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">print("The total cost for this job: $" + str(h*w*p+20))</span></p>
<p> </p></body></html>
Is there any way I can remove the mark-up in batch so that all that is left is:
print("Bob and Bill Tiling Solutions Inc.")
h=int(input("Height (m):"))
w=int(input("Width (m):"))
p=int(input("Cost ($/m^2):"))
print("The total cost for this job: $" + str(h*w*p+20))
If there is a third-party utility that does this I would be happy to download it.
I have tried using regular expressions through findstr
with no avail (My search string is "<[^>]*>"
but I do not know how to use findstr
to remove all results in the text file)
Any suggestions are welcome.
Here's a SED
script (I use GNUSED) which I adapted from Eric Pement's SED One-liners:
the sed line
sed -f dehtml.sed yourfilename
The file dehtml.sed
:a
s/<[^>]*>//g;/</N;//ba