Search code examples
htmlcstringescapingentities

How to escape html entities in C?


I'm trying to decode HTML entities (in the format ') in C.

So far I've got some code to try and decode them but it seems to produce odd output.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char* convertHtmlEntities(char* str) {
    size_t length = strlen(str);
    size_t i;
    char *endchar = malloc(sizeof(char));
    long charCode;
    if (!endchar) {
        fprintf(stderr,"not enough memory");
        exit(EXIT_FAILURE);
    }
    for (i=0;i<length;i++) {
        if (*(str+i) == '&' && *(str+i+1) == '#' && *(str+i+2) >= '0' && *(str+i+2) <= '9' && *(str+i+3) >= '0' && *(str+i+3) <= '9' && *(str+i+4) == ';') {
            charCode = strtol(str+i+2,&endchar,0);
            printf("ascii %li\n",charCode);
            *(str+i) = charCode;
            strncpy(str+i+1,str+i+5,length - (i+5));
            *(str + length - 5) = 0; /* null terminate string */
        }
    }
    return str;
}

int main()
{
    char string[] = "Helloworld&#39;s parent company has changed - comF";
    printf("%s",convertHtmlEntities(&string));
}

I'm not sure if the main statement is correct because I just made it for this example as my program generates it from a web url, however the idea is the same.

The function does replace the &#39; with a apostrophe, but the output is garbled at the end and just after the replacement.

Does anyone have a solution?


Solution

  • strncpy (or strcpy) does not work for overlapping strings.

    Your strings str+i+1 and str+i+5 overlap. Don't do that!

    Replace strncpy with memmove

                *(str+i) = charCode;
                memmove(str+i+1,str+i+5,length - (i+5) + 1); /* also copy the '\0' */
                /* strncpy(str+i+1,str+i+5,length - (i+5)); */
                /* *(str + length - 5) = 0; */ /* null terminate string */