Search code examples
cstrtok

Strtok strange behaviour


I'm having some troubles using strtok function. As an exercise I have to deal with a text file by ruling out white spaces, transforming initials into capital letters and printing no more than 20 characters in a line.

Here is a fragment of my code:

fgets(sentence, SIZE, f1_ptr);
    char *tok_ptr = strtok(sentence, " \n"); //tokenazing each line read
    tok_ptr[0] = toupper(tok_ptr[0]); //initials to capital letters

    int num = 0, i;

    while (!feof(f1_ptr)) {
        while (tok_ptr != NULL) {
            for (i = num; i < strlen(tok_ptr) + num; i++) {
                if (i % 20 == 0 && i != 0) //maximum of 20 char per line
                    fputc('\n', stdout);
                fputc(tok_ptr[i - num], stdout);
            }

            num = i;

            tok_ptr = strtok(NULL, " \n");
            if (tok_ptr != NULL)
                tok_ptr[0] = toupper(tok_ptr[0]);
        }

        fgets(sentence, SIZE + 1, f1_ptr);
        tok_ptr = strtok(sentence, " \n");
        if (tok_ptr != NULL)
            tok_ptr[0] = toupper(tok_ptr[0]);
    }

The text is just a bunch of lines I just show as a reference:

Watch your thoughts ; they become words .
Watch your words ; they become actions .
Watch your actions ; they become habits .
Watch your habits ; they become character .
Watch your character ; it becomes your destiny .

Here is what I obtain in the end:

WatchYourThoughts;Th
eyBecomeWords.WatchY
ourWords;THeyBecomeA
ctions.WatchYourActi
ons;TheyBecomeHabits
.WatchYourHabits;The
yBecomeCharacteR.Wat
chYourCharacter;ItBe
comesYourDEstiny.Lao
-Tze

The final result is mostly correct, but sometimes (for example "they" in they become (and only in that case) or "destiny") words are not correctly tokenized. So for example "they" is split into "t" and "hey" resulting in THey (DEstiny in the other instance) after the manipulations I made. Is it some bug or am I missing something? Probably my code is not that efficient and some condition may end up being critical...

Thank you for the help, it's not that big of a deal, I just don't understand why such a behaviour is occurring.


Solution

  • You have a large number of errors in your code and you are over-complicating the problem. The most pressing error is Why is while ( !feof (file) ) always wrong? Why? Trace the execution-path within your loop. You attempt to read with fgets(), and then you use sentence without knowing whether EOF was reached calling tok_ptr = strtok(sentence, " \n"); before you ever get around to checking feof(f1_ptr)

    What happens when you actually reach EOF? That IS "Why while ( !feof (file) ) is always wrong?" Instead, you always want to control your read-loop with the return of the read function you are using, e.g. while (fgets(sentence, SIZE, f1_ptr) != NULL)

    What is it you actually need your code to do?

    The larger question is why are you over-complicating the problem with strtok, and arrays (and fgets() for that matter)? Think about what you need to do:

    1. read each character in the file,
    2. if it is whitespace, ignore it, set the in-word flag false,
    3. if a non-whitespace, if 1st char in word, capitalize it, output the char, set the in-word flag true and increment the number of chars output to the current line, and finally
    4. if it is the 20th character output, output a newline and reset the counter zero.

    The bare-minimum tools you need from your C-toolbox are fgetc(), isspace() and toupper() from ctype.h, a counter for the number of characters output, and a flag to know if the character is the first non-whitespace character after a whitespace.

    Implementing the logic

    That makes the problem very simple. Read a character, is it whitespace?, set your in-word flag false, otherwise if your in-word flag is false, capitalize it, output the character, set your in-word flag true, increment your word count. Last thing you need to do is check if your character-count has reached the limit, if so output a '\n' and reset your character-count zero. Repeat until you run out of characters.

    You can turn that into a code with something similar to the following:

    #include <stdio.h>
    #include <ctype.h>
    
    #define CPL 20      /* chars per-line, if you need a constant, #define one (or more) */
    
    int main (int argc, char **argv) {
        
        int c, in = 0, n = 0;   /* char, in-word flag, no. of chars output in line */
        /* use filename provided as 1st argument (stdin by default) */
        FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
        
        if (!fp) {  /* validate file open for reading */
            perror ("file open failed");
            return 1;
        }
        
        while ((c = fgetc(fp)) != EOF) {            /* read / validate each char in file */
            if (isspace(c))                         /* char is whitespace? */
                in = 0;                             /* set in-word flag false */
            else {  /* otherwise, not whitespace */
                putchar (in ? c : toupper(c));      /* output char, capitalize 1st in word */
                in = 1;                             /* set in-word flag true */
                n++;                                /* increment character count */
            }
            if (n == CPL) {                         /* CPL limit reached? */
                putchar ('\n');                     /* output newline */
                n = 0;                              /* reset cpl counter */
            }
        }
        putchar ('\n');     /* tidy up with newline */
        
        if (fp != stdin)    /* close file if not stdin */
            fclose (fp);
    }
    

    Example Use/Output

    Given your input file stored on my computer in dat/text220.txt, you can produce the output you are looking for with:

    $ ./bin/text220 dat/text220.txt
    WatchYourThoughts;Th
    eyBecomeWords.WatchY
    ourWords;TheyBecomeA
    ctions.WatchYourActi
    ons;TheyBecomeHabits
    .WatchYourHabits;The
    yBecomeCharacter.Wat
    chYourCharacter;ItBe
    comesYourDestiny.
    

    (the executable for the code was compiled to bin/text220, I usually keep separate dat, obj, and bin directories for data, object files and executables to keep by source code directory clean)

    note: by reading from stdin by default if no filename is provided as the first argument to the program, you can use your program to read input directly, e.g.

    $ echo "my dog      has   fleas  -   bummer!" | ./bin/text220
    MyDogHasFleas-Bummer
    !
    

    No fancy string functions required, just a loop, a character, a flag and a counter -- the rest is just arithmetic. It's always worth trying to boils your programming problems down to basic steps and then look around your C-toolbox and find the right tool for each basic step.

    Using strtok

    Don't get me wrong, there is nothing wrong with using strtok and it makes a fairly simple solution in this case -- the point I was making is that for simple character-oriented string-processing, it's often just a simple to loop over the characters in the line. You don't gain any efficiencies using fgets() with an array and strtok(), the read from the file is already placed into a buffer of BUFSIZ1.

    If you did want to use strtok(), you should control you read-loop your with the return from fgets()and then you can tokenize with strtok() also checking its return at each point. A read-loop with fgets() and a tokenization loop with strtok(). Then you handle first-character capitalization and then limiting your output to 20-chars per-line.

    You could do something like the following:

    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    
    #define CPL 20      /* chars per-line, if you need a constant, #define one (or more) */
    #define MAXC 1024
    #define DELIM " \t\r\n"
    
    void putcharCPL (int c, int *n)
    {
        if (*n == CPL) {            /* if n == limit */
            putchar ('\n');         /* output '\n' */
            *n = 0;                 /* reset value at mem address 0 */
        }
        putchar (c);                /* output character */
        (*n)++;                     /* increment value at mem address */
    }
    
    int main (int argc, char **argv) {
        
        char line[MAXC];    /* buffer to hold each line */
        int n = 0;          /* no. of chars ouput in line */
        /* use filename provided as 1st argument (stdin by default) */
        FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
        
        if (!fp) {  /* validate file open for reading */
            perror ("file open failed");
            return 1;
        }
        
        while (fgets (line, MAXC, fp))  /* read each line and tokenize line */
            for (char *tok = strtok (line, DELIM); tok; tok = strtok (NULL, DELIM)) {
                putcharCPL (toupper(*tok), &n);     /* convert 1st char to upper */
                for (int i = 1; tok[i]; i++)        /* output rest unchanged */
                    putcharCPL (tok[i], &n);
            }
        putchar ('\n');     /* tidy up with newline */
        
        if (fp != stdin)    /* close file if not stdin */
            fclose (fp);
    }
    

    (same output)

    The putcharCPL() function is just a helper that checks if 20 characters have been output and if so outputs a '\n' and resets the counter. It then outputs the current character and increments the counter by one. A pointer to the counter is passed so it can be updated within the function making the updated value available back in main().

    Look things over and let me know if you have further questions.

    footnotes:

    1. Depending on your version of gcc, the constant in the source setting the read-buffer size may be _IO_BUFSIZ. _IO_BUFSIZ was changed to BUFSIZ here: glibc commit 9964a14579e5eef9 For Linux BUFSIZE is defined as 8192 (512 on Windows).