Search code examples
cjsonstringlexer

String goes from normal characters to garbage for an as of yet indiscernible reason


For my own personal learning I'm trying to make a parser for JSON in c. Currently I am having some trouble with the lexer. Everything works the I want apart from the STRING token. For whatever reason when the tokens are printed out, the token image (the actual string) for the STRING token is just complete garbage. I've included how I print out the tokens. The mess of symbols is supposed to say "nullvalue". I added the size of the string and pointer address in the output while I was trying to figure out what was going on. The size is also a garbage value for some reason.

Type: CURLYOPEN
Image: <NULL>
Line number: 1
|
V
Type: STRING
Image: ����UH��}��}�wQ�E�H�� of size 32765 at 0x7ffdfd6cea70
Line number: 2
|
V
Type: COLON
Image: <NULL>
Line number: 2
|
V
Type: NULL
Image: <NULL>
Line number: 2
|
V
Type: CURLYCLOSED
Image: <NULL>
Line number: 3
|
V
<NULL>

This is the JSON I'm trying to tokenise

{
    "nullvalue" : null
}

This is the code where the STRING token is produced

...
case '\"':
            AADString stringimage = new_aadstring();
            if (!get_stringimage(fd, &stringimage)) return handle_lexerror(E_UNEXPECTEDTERMINATION_ERROR_LEXER_AJSON, linenr);
            new_token(&token, E_STRING_TOKEN_AJSON, &stringimage, linenr);
            break;
...

This is the function where the string image is retrieved

int get_stringimage(FILE *fd, AADString *stringimage)
{
    char c;
    while ((c = fgetc(fd)) != '\"')
    {
        if (c == EOF)
        {
            free_aadstring(stringimage);
            return 0;
        }
        if (c == '\\')
        {
            appendto_aadstring(c, stringimage);
            c = fgetc(fd);
            if (c == EOF)
            {
                free_aadstring(stringimage);
                return 0;
            }
        }
        appendto_aadstring(c, stringimage);
    }
    appendto_aadstring('\0', stringimage);
    return 1;
}

This is where I create a new token and move to the next empty one

void new_token(T_TokenAJSON **token, E_TypeTokenAJSON type, AADString *image, int linenr)
{
    (*token)->type = type;
    (*token)->image = image;
    (*token)->linenr = linenr;
    (*token)->next = (T_TokenAJSON *) malloc(sizeof(T_TokenAJSON));
    (*token)->next->next = NULL;
    (*token) = (*token)->next;
}

These are the relevant token struct and enum

typedef enum 
{
    E_CURLYOPEN_TOKEN_AJSON,
    E_CURLYCLOSED_TOKEN_AJSON,
    E_SQUAREOPEN_TOKEN_AJSON,
    E_SQUARECLOSED_TOKEN_AJSON,
    E_COLON_TOKEN_AJSON,
    E_COMMA_TOKEN_AJSON,
    E_STRING_TOKEN_AJSON,
    E_NUMBER_TOKEN_AJSON,
    E_TRUE_TOKEN_AJSON,
    E_FALSE_TOKEN_AJSON,
    E_NULL_TOKEN_AJSON
} E_TypeTokenAJSON;

typedef struct T_TokenAJSON
{
    E_TypeTokenAJSON type;
    AADString *image;
    int linenr;
    struct T_TokenAJSON *next;
} T_TokenAJSON;

This is how the tokens are initialised

T_TokenAJSON *roottoken = (T_TokenAJSON *) malloc(sizeof(T_TokenAJSON));
roottoken->next = NULL;

This is then passed into the function where the lexing happens. That function is where that case block from earlier comes from

This is the string struct I use and it's relevant functions

typedef struct AADString
{
    int size;
    int nextidx;
    char *content;
} AADString;

AADString new_aadstring()
{
    AADString string;
    string.size = INIT_SIZE; // INIT_SIZE = 2
    string.nextidx = 0;
    string.content = (char *) malloc(INIT_SIZE*sizeof(char));
    return string;
}

void resize_aadstring(AADString *string)
{
    if (string->nextidx == string->size)
    {
        string->size *= 2;
        string->content = (char *) realloc(string->content, string->size*sizeof(char));
    }
}

void appendto_aadstring(char c, AADString *string)
{
    resize_aadstring(string);
    string->content[string->nextidx] = c;
    string->nextidx++;
}

void free_aadstring(AADString *string)
{
    free(string->content);
}

So far:

  • I tried to see if the memory wasn't being allocated for the AADString but that seemed to be fine.
  • I made sure that the address stays the same from when the string is made to when it is stored in the token and it seems to be fine.
  • I checked to see if the content in AADString could be printed after get_stringimage() and that was fine.
  • I also checked to see if it could be printed after being thrown in the token while still in new_token().

Thinking now, I feel perhaps maybe something goes wrong when moving to the next token, but if that is the case, I do not know what could be going wrong.

I am not a great c programmer, but I feel I have some grasp on what's going on in general with the language. Although right now I am stumped.

Any assistance would be greatly appreciated.


Solution

  • I have managed to fix the problem. Upon changing the function new_aadstring() to instead return a pointed to AADString, and making small adjustments to the rest of the code to adhere to this change, the garbage characters have now disappeared. My thanks to user @Barmar, for the suggestion. Admittedly I am still not sure why that fixed the issue. My understanding of C's deep magicks needs some work I suppose.