Search code examples
cstringduplicatestokenizestrtok

String Tokenization problem occurs when deleting duplicate words from a string


What I'm trying to do in the following code is to tokenize a string and store every token in a dynamic allocated structure but exclude any duplicates.

This code kind of works, until I enter a string that contains two equal words. For example, the string "this this", will also store the second word even though it's the same. But if I enter "this this is" instead, it removes the second "this", and completely ignores the last word of the string, so that it doesn't get deleted if there's a duplicate in the string.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define dim 70


typedef struct string {
  char* token[25];
} string;

int main() {

  string* New = malloc(dim*sizeof(string));

  char* s;
  char* buffer = NULL;
  int i = 0, r = 0;

  s = malloc(dim * sizeof(char));

  fgets(s, dim, stdin);

  printf("The string is: %s\n", s); 

  New->token[i] = malloc(dim*sizeof(char));
  New->token[i] = strtok(s, " ");
  ++i;

  while((buffer = strtok(NULL, " ")) && buffer != NULL){

    printf("\nbuffer is: %s", buffer);

    for(r = 0; r < i; ++r) {

      if(strcmp(New->token[r], buffer) != 0 && r == i-1) {

        New->token[i] = malloc(strlen(buffer)*sizeof(char)+1);
        New->token[i] = buffer;
        ++i;

      }
      else if(New->token[r] == buffer) {
            break;
      }

    }



  }

printf("\n New string: ");
for(i = 0; New->token[i] != NULL; ++i) {
   printf(" %s", New->token[i]);
}


return 0;
}

In my mind this should work fine but I'm really having a hard time finding what I did wrong here. If you need additional info just ask me please, I apologise for any eventual lack of clarity (and for my english).


Solution

  • Complete re-write of this answer to address some fundamentally wrong things I did not see the first time through. See in-line comments in the code at bottom to explain some of the construct changes:

    I ran your code exactly as is and saw what you are describing, and other than the note about using strcmp in the other answer, found several lines of code that can be adjusted, or removed to make it do what you described it should:

    First, the struct definition creates a pointer to an array of char. Based on what you are doing later in the code, what you need is a simple array of char

    typedef struct string {
      //char* token[25]; //this create a pointer to array of 25 char
      char token[25]; //this is all you need
    } string;
    

    As you will see later, this will greatly simplify memory allocation.

    some basic problems:

    Include the \n newline character in your parsing delimiter. When <enter> is hit as the end of entering the string, a newline is appended, causing the first instance of this and the second instance of this\n to be unequal.

    while((buffer = strtok(NULL, " \n")) && buffer != NULL){
                                   ^^
    

    This line is creating uninitialized memory.

    string* New = malloc(dim*sizeof(string)); 
    

    A note about using malloc() vs. calloc(): malloc() leaves the memory it creates uninitialized, while calloc() creates a block of memory all initialized to 0.

    Memory created using malloc()

    enter image description here

    Memory created using calloc():

    enter image description here

    This becomes important in several places in your code, but in particular I see a problem in the last section:

    for(i = 0; New->token[i] != NULL; ++i) {
       printf(" %s", New->token[i]);
    }
    

    If the memory created for New is not initialized, you can get a run-time error when the index i is incremented beyond the area in memory that you have explicitly written to, and loop attempts to test New->token[i]. If New->token[i] contains anything but 0, it will attempt to print that area of memory.

    You should also free each instance of memory created in your code with a corresponding call to free().

    All of this, and more is addressed in the following re-write of your code: (tested against this is a string a string.)

    typedef struct string {
      //char* token[25]; //this create a pointer to array of 25 char
      char token[25]; //this is all you need
    } string;
    
    int main() {
        char* s;
        char* buffer = NULL;
        int i = 0, r = 0;
    
        string* New = calloc(dim, sizeof(string));//Note: This creates an array of New.
                                                  //Example: New[i]
                                                  //Not: New->token[i]
        s = calloc(dim , sizeof(char));
        fgets(s, dim, stdin);
        printf("The string is: %s\n", s); 
        buffer = strtok(s, " \n");
        strcpy(New[i].token, buffer); //use strcpy instead of = for strings
        //restuctured the parsing loop to a more conventional construct
        // when using strtok:
        if(buffer)
        {
            ++i;
            while(buffer){
                printf("\nbuffer is: %s", buffer);
                for(r = 0; r < i; ++r) {
                    if(strcmp(New[r].token, buffer) != 0 && r == i-1) {
                        strcpy(New[i].token, buffer);
                        ++i;
                    }
                    else if(strcmp(New[r].token, buffer)==0) {
                        break;
                    }
                }
                buffer = strtok(NULL, " \n");
            }
        }
        printf("\n New string: ");
        for(i = 0; i<dim; i++) {
            if(New[i].token) printf(" %s", New[i].token);
        }
        free(New);
        free(s);
        return 0;
    }