Search code examples
cparsingtext

Splitting input string into words and saving to an array


I have been going at this for hours, and I can't figure out what the problem in my code is. I'm currently writing a very simple assembler for a custom instruction set architecture. This assembler takes an input file and simply parses line by line. In the parsing process, I intend to split each line up by spaces, writing the tokens to an array for processing. Below is some of the code to do that:

    char** tokens = (char**) malloc(sizeof(char*));
    char* linecpy = strcpy(linecpy, line);

    char* tok_ptr = strtok(linecpy, " ");
    int tokenid = 0;
    while(tok_ptr) {
        tokens = (char**) realloc(tokens, (tokenid+1) * sizeof(char*));
        tokens[tokenid] = tok_ptr;
        tokenid++;
        tok_ptr = strtok(NULL, " ");
    }

To test that this is accurately working, I'm having it print out each token sequentially from the array, and I'm finding random splits in the middle of the tokens that shouldn't be there. Here is an example:

Line from assembly file:

    jsr fibloop      ; jump to the main program loop

Expected Output from splitting by spaces:

    jsr
    fibloop
    ;
    jump
    to
    the
    main
    program
    loop

Actual Output:

    jsr
    fibloo
    p
    ;
    jump
    to
    the
    main
    pr
    ogram
    l
    oop

I've spent so long trying to solve this to no avail, and feedback on how to potentially solve this would be greatly appreciated

EDIT: Solution to this was pointed out by Clifford and 4386427, the problem was that linecpy had no memory allocated to it, and strcpy doesn't directly return a new string as I had incorrect assumed. The working code has been put below, and I've included a comment filter to stop tokenization after the parser hits a comment character, something pointed out by Clifford

    char** tokens = (char**) malloc(sizeof(char*));
    char* linecpy = malloc(strlen(line) + 1);

    strcpy(linecpy, line);

    char* tok_ptr = strtok(linecpy, " ");
    int tokenid = 0;
    while(tok_ptr) {
         /* 
            If a token starts with a comment character then we stop tokenization, 
            as everything after will be commented and is of no use to the parser
        */
        if(tok_ptr[0] == ';') break;
        tokens = (char**) realloc(tokens, (tokenid+1) * sizeof(char*));
        tokens[tokenid] = tok_ptr;
        tokenid++;
        tok_ptr = strtok(NULL, " ");
    }
    // free memory allocated to tokens after parsing
    free(tokens);

Hopefully this helps anyone with the same problem I had, the quick responses given by members of this community was extremely helpful. Thanks guys!


Solution

  • char* linecpy = strcpy(linecpy, line);
    

    is illegal. linecpy has no allocated memory. You need

    char* linecpy = malloc(strlen(line) + 1);
    strcpy(linecpy, line);
    

    Besides that:

    char** tokens = (char**) malloc(sizeof(char*));
    

    should be

    char** tokens = NULL;