Search code examples
cstringtokentokenize

Capturing words within spaces and quotation marks?


The idea, explicit in the title, is to capture words within spaces and quotation marks here's an example of the input we are dealing with:

Input:

The Brown "Fox Jumps Over" "The Lazy" Dog

Currently my code can capture words within spaces, as many of you know, a basic strtok() is enough. Here's my code so far:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>

int main () {
   char command[BUFSIZ];
   char *token;
   fgets(command,BUFSIZ,stdin);
   
   token = strtok(command, " ");

   while( token != NULL ) {
      printf( " %s\n", token );
    
      token = strtok(NULL, " ");
   }
   
   return 0;
}

And as expected, my code prints the following:

Current Output:

The
Brown
"Fox
Jumps
Over"
"The
Lazy"
Dog

But the whole idea and problem is to get the following output:

The
Brown
Fox Jumps Over
The Lazy
Dog

All the help is welcome and I thank you in advance. (PS: The included libraries are the only ones allowed.)


Solution

  • This program works for your input, it employs a tiny state machine that prevents splitting between quotes. strtok is pretty limited for cases more complicated than a single split token IMO:

    #include <stdio.h>
    #include <stdlib.h>
    
    void prn(char* str) {
        printf("<< %s >>\n", str);
    }
    
    int main(){
        char command[BUFSIZ];
        char state = 0;
        char *start = NULL;
        char *cur = NULL;
        
        fgets(command, BUFSIZ, stdin);
        start = cur = command;
        
        while (*cur) {
            if (state == 0 && *cur == ' ') {
                /* space outside quotes */
                *cur = 0;
                prn(start);
                start = cur+1;
                cur++;
            } else if (*cur == '"') {
                /* quote found */
                *cur = 0;
                if (state) {
                    /* end quote -- print */
                    prn(start);
                    
                    /* skip past spaces */
                    cur++;
                    while (*cur == ' ')
                        cur++;
                } else {
                    /* in quote, move cursor forward */
                    cur++;
                }
                /* flip state and reset start */
                state ^= 1;
                start = cur;
            } else {
                cur++;
            }
            if (cur - command >= BUFSIZ) {
                fprintf(stderr, "Buffer overrun\n");
                return -1;
            }
        }
        /* print the last string */
        prn(start);
        
        return 0;
    }
    
    

    The output:

    ➜ echo -n 'The Brown "Fox Jumps Over" "The Lazy" Dog' |./a.out
    << The >>
    << Brown >>
    << Fox Jumps Over >>
    << The Lazy >>
    << Dog >>
    

    [edit: tidied following feedback, printing delimited to catch any sneaky spaces creeping through]