Search code examples
htmlchtml-parsing

Replace HTML escape sequence with its single character equivalent in C


My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape in JavaScript.

I know there are libraries in C to parse html. But is there some easy way to convert html escape sequences like & or ! to just & and !?


Solution

  • This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:

    What's the easiest way to escape HTML in Python?

    How do you call Python code from C code?

    But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:

    parseFile()
        while not EOF
            ch = readNextCharacter()
            if ch == '\'
                readNextCharacter()
            elseif ch == '&'
                readEscapeSequence()
            else
                output += ch
    
    readEscapeSequence()
        seq = ""
        ch = readNextCharacter();
        while ch != ';'
            seq += ch
            ch = readNextCharacter();
        replace = lookupEscape(seq)
        output += replace
    

    Note that this is only pseudo code to get you started