My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape
in JavaScript.
I know there are libraries in C to parse html.
But is there some easy way to convert html escape sequences like &
or !
to just &
and !
?
This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:
What's the easiest way to escape HTML in Python?
How do you call Python code from C code?
But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:
parseFile()
while not EOF
ch = readNextCharacter()
if ch == '\'
readNextCharacter()
elseif ch == '&'
readEscapeSequence()
else
output += ch
readEscapeSequence()
seq = ""
ch = readNextCharacter();
while ch != ';'
seq += ch
ch = readNextCharacter();
replace = lookupEscape(seq)
output += replace
Note that this is only pseudo code to get you started