Replace HTML escape sequence with its single character equivalent in C

My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape in JavaScript.

I know there are libraries in C to parse html. But is there some easy way to convert html escape sequences like & or ! to just & and !?

Solution

This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:

What's the easiest way to escape HTML in Python?

How do you call Python code from C code?

But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:

parseFile()
    while not EOF
        ch = readNextCharacter()
        if ch == '\'
            readNextCharacter()
        elseif ch == '&'
            readEscapeSequence()
        else
            output += ch

readEscapeSequence()
    seq = ""
    ch = readNextCharacter();
    while ch != ';'
        seq += ch
        ch = readNextCharacter();
    replace = lookupEscape(seq)
    output += replace

Note that this is only pseudo code to get you started