Search code examples
cparsingcompiler-constructionflex-lexer

Why is malloc() allocating 2 more bytes than its supposed to?


I'm writing a c- compiler. Flex recognizes my string token and sends it to a function to store it in a struct{} containing info about it, but first the string needs to have escape chars removed, which is a ''. Here is my code that does that:

char* removeEscapeChars(char* svalue)
{
    char* processedString; //will be the string with escape characters removed
    int svalLen = strlen(svalue);
    printf("svalLen (size of string passed in): %d\n", svalLen);
    printf("svalue (string passed in): %s\n", svalue);
    int foundEscapedChars = 0;
    for (int i = 0; i < svalLen;) 
    {
        if (svalue[i] == '\\') {
            //Found escaped character
            if (svalue[i+1] == 'n') {
                //Found newline character
                svalue[i] = int('\n');
            }
            else if (svalue[i+1] == '0') {
                //Found null character
                svalue[i] = int('\0');
            }
            else {
                //Any other character
                svalue[i] = svalue[i+1];
            }
            i++;
            foundEscapedChars++;
            for (int j = i; j < svalLen + 1; j++) {
                svalue[j] = svalue[j+1];
            }
        }
        else {
            i++;
        }
    }
    int newSize = svalLen - foundEscapedChars;
    processedString = (char*) malloc(newSize * sizeof(char));
    memcpy(processedString, svalue, newSize * sizeof(char));
    printf("newSize: %d\n", newSize);
    printf("processedString: %s\n", processedString);
    printf("processedString Size: %d\n", strlen(processedString));
    
    free(svalue);
    return processedString;
}

It works 99% of the time, but when its tested on this specific string (or a similar one with 40 characters) "-//W3C//DTD XHTML 1.0 Transitional//EN", malloc() appears to be allocating memory for a string 2 bytes too large. The output for this is below. Notice that I used int newSize in my call to malloc(), which it says is of value 40, and then strlen() returns 42. sizeof(char) is == 1 also. The main issue is its inserting garbage characters at the end of the string. What gives?

"-//W3C//DTD XHTML 1.0 Transitional//EN"
svalLen (size of string passed in): 40
svalue (string passed in) "-//W3C//DTD XHTML 1.0 Transitional//EN"
newSize: 40
processedString: "-//W3C//DTD XHTML 1.0 Transitional//EN"Z
processedString Size: 42
Line 47 Token: STRINGCONST Value: "-//W3C//DTD XHTML 1.0 Transitional//EN"Z Len: 40 Input: "-//W3C//DTD XHTML 1.0 Transitional//EN"

Solution

  • Here's a reworking of your code that takes a different, more conventional approach to processing strings. Start first with a function that counts escape characters, as this will be useful in the next step:

    int escapeCount(char* str) {
        int c = 0;
    
        // Can just increment and work through the string using the given pointer
        while (*str) {
            // Backslash something here
            if (*str == '\\') {
                ++str;
                ++c;
            }
    
            if (*str) {
              // Handle unmatched \ at end of string
              ++str;
            }
        }
    
        return c;
    }
    

    Now using that information you can allocate the correct buffer size:

    char* removeEscapeChars(char* str)
    {
        // IMPORTANT: Allocate strlen() + 1 for the NUL byte not counted
        char* result = malloc(strlen(str) - escapeCount(str) + 1);
        char* r = result;
    
        do {
            if (*str == '\\') {
                ++str;
    
                switch (*str) {
                    case 'n':
                        *r = '\n';
                        break;
                    case 'r':
                        *r = '\r';
                        break;
                    case 't':
                        *r = '\t';
                        break;
                    default:
                        *r = *str;
                        break;
                }
            }
            else {
                *r = *str;
            }
    
            if (*str) {
              ++str;
            }
    
            ++r;
        } while(*str);
    
        return result;
    }