Search code examples
arrayscstringcomputer-sciencetokenize

How to properly tokenize strings in C without random symbols?


I'm currently learning C and trying to write a function to tokenize a paragraph/string delimited by spaces and return an array with all the tokens. I'm stuck because I can't figure out why some token will carry symbols that are not in the original string. Can someone help me figure out what's wrong with my code? Plus I don't want to add additional library into the code or use functions like strtok().

char **tokenizeParagraph(char *paragraph) {
    char *ptr = paragraph;
    char words[MAX_WORDS][MAX_WORDLENGTH];
    int wordIndex = 0;
    int wordLen = 0;

    while (*ptr) {
        wordLen = 0;

        while (*ptr && *ptr != ' ') {
            wordLen++;
            ptr++;
        }

        if (wordLen > 0) {
            strncpy(words[wordIndex], paragraph, wordLen);
            printf("%s\n", words[wordIndex]);
            wordIndex++;
        }

        ptr++;
        paragraph = ptr;
    }
    return words;
}

Here's a demo result:

tokenizeParagraph("I'm currently learning C and trying to write a function to tokenize a paragraph/string delimited by spaces and return an array with all the tokens.");

Error Demo

Much appreciated!

Edited:

The dynamic memory methods @Sourav Kannantha B and @Finxx suggested are very helpful. However since I didn't want to add <stdlib.h> library, I moved the array declaration out of the function and passed it in as a parameter, so the array will not be erased with stack memory after the function returns.

char words[MAX_WORDS][MAX_CHARS];
void tokenizeParagraph(char words[MAX_WORDS][MAX_CHARS], char *paragraph)

Solution

  • What @Finxx already suggested is good enough. But you can still improve it if wordLen varies very widely.

    char **tokenizeParagraph(char *paragraph) {
        char *ptr = paragraph;
        char** words = malloc(sizeof(char*) * MAX_WORDS);
        int wordIndex = 0;
        int wordLen;
    
        while (*ptr) {
            wordLen = 0;
    
            while (*ptr && *ptr == ' ') {
                ptr++;
            }
    
            paragraph = ptr;
    
            while (*ptr && *ptr != ' ') {
                wordLen++;
                ptr++;
            }
    
            if (wordLen > 0) {
                words[wordIndex] = malloc(sizeof(char) * wordLen+1);
                strncpy(words[wordIndex], paragraph, wordLen);
                words[wordIndex][wordLen] = '\0';
                printf("%s\n", words[wordIndex]);
                wordIndex++;
            }
        }
    
        for(;wordIndex < MAX_WORDS; wordIndex++) {
            words[wordIndex] = NULL;
        }
        return words;
    }
    

    Also, note that strncpy does not add terminating NUL character. This is probably the reason for random characters appearing in the output.

    Also, don't forget to free the allocated memory from caller function.:

    int main() {
        ...
        char** words = tokenizeParagraph(para);
        ...
        for(int i = 0; i < MAX_WORDS; i++) {
            free(words[i]);
        }
        free(words);
        ...
        return 0;
    }