Search code examples
cstringparsingtokenizedelimiter

split char string with multi-character delimiter in C


I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter.

I want to split char *string based on a substring such as "abc" or any other sub-string. How that can be achieved?


Solution

  • Finding the point at which the desired sequence occurs is pretty easy: strstr supports that:

    char str[] = "this is abc a big abc input string abc to split up";
    char *pos = strstr(str, "abc");
    

    So, at that point, pos points to the first location of abc in the larger string. Here's where things get a little ugly. strtok has a nasty design where it 1) modifies the original string, and 2) stores a pointer to the "current" location in the string internally.

    If we didn't mind doing roughly the same, we could do something like this:

    char *multi_tok(char *input, char *delimiter) {
        static char *string;
        if (input != NULL)
            string = input;
    
        if (string == NULL)
            return string;
    
        char *end = strstr(string, delimiter);
        if (end == NULL) {
            char *temp = string;
            string = NULL;
            return temp;
        }
    
        char *temp = string;
    
        *end = '\0';
        string = end + strlen(delimiter);
        return temp;
    }
    

    This does work. For example:

    int main() {
        char input [] = "this is abc a big abc input string abc to split up";
    
        char *token = multi_tok(input, "abc");
    
        while (token != NULL) {
            printf("%s\n", token);
            token = multi_tok(NULL, "abc");
        }
    }
    

    produces roughly the expected output:

    this is
     a big
     input string
     to split up
    

    Nonetheless, it's clumsy, difficult to make thread-safe (you have to make its internal string variable thread-local) and generally just a crappy design. Using (for one example) an interface something like strtok_r, we can fix at least the thread-safety issue:

    typedef char *multi_tok_t;
    
    char *multi_tok(char *input, multi_tok_t *string, char *delimiter) {
        if (input != NULL)
            *string = input;
    
        if (*string == NULL)
            return *string;
    
        char *end = strstr(*string, delimiter);
        if (end == NULL) {
            char *temp = *string;
            *string = NULL;
            return temp;
        }
    
        char *temp = *string;
    
        *end = '\0';
        *string = end + strlen(delimiter);
        return temp;
    }
    
    multi_tok_t init() { return NULL; }
    
    int main() {
        multi_tok_t s=init();
    
        char input [] = "this is abc a big abc input string abc to split up";
    
        char *token = multi_tok(input, &s, "abc");
    
        while (token != NULL) {
            printf("%s\n", token);
            token = multi_tok(NULL, &s, "abc");
        }
    }
    

    I guess I'll leave it at that for now though--to get a really clean interface, we really want to reinvent something like coroutines, and that's probably a bit much to post here.