Search code examples
cstringpointerslinked-listtokenize

Tokenizing input file into linked list


I'm attempting to tokenize an input file and store its individual words within a linked list organized by word count. I've been struggling with storing the tokenized string into a node, and am struggling to understand what is incorrect in my tokenizing/inserting process. Currently, when printing the stored strings out, the first letter of each string is truncated off, and there is seemingly random garbage and the end of each string. I have tried the following to fix my error:

  1. Null-terminating each string after tokenization (I've left that in my program as it appears to be correct)
  2. Using strncpy() instead of new_word->str = str;
  3. Passing a pointer to the tokenized string to my insert function, instead of just passing the string itself.

Below is my code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <strings.h>

typedef struct word{ 
    int length; 
    char *str; 
    struct word *left;
    struct word *right; 
    struct word *down;
}word;


void print_list(word **head){ 

    word *temp_traverse = *head;
    word *temp_down;

    for( ; temp_traverse!=NULL; temp_traverse = temp_traverse->right){ 
        temp_down = temp_traverse;
        for( ; temp_down!=NULL; temp_down = temp_down->down){ 
            printf("Count: %d, String: %s\n", temp_down->length, temp_down->str);
        }
    }

}


int is_empty(word **head, word **tail){ 

    if((*head == NULL)||(*tail == NULL))
        return 1;

    return 0;
}

void insert(word **head, word **tail, word *new_word){ 

    if(is_empty(head, tail)){
        (*head) = new_word; 
        (*tail) = new_word;
        return;
    }

    if((new_word->length)<((*head)->length)){ 
        new_word->right = (*head);
        (*head)->left = new_word;
        (*head) = new_word;
        return;
    }

    word *temp = *head;

    while(((temp->right)!=NULL) && ((temp->length)<(new_word->length))) 
        temp = temp->right;

    if((temp->length) == (new_word->length)){
        while(temp->down != NULL)
            temp = temp->down;
        temp->down = new_word;
        return;
    }

    if(temp->right == NULL){
        word* last = (*tail);
        last->right = new_word;
        new_word->left = last; 
        (*tail) = new_word;
        return;
    }

    word* next = temp->right;
    temp->right = new_word;
    next->left = new_word; 
    new_word->left = temp; 
    new_word->right = next;

    return;
}

void create(word **head, word **tail, char **str){ 

    word *new_word = (word*)malloc(sizeof(word));
    int length = strlen(*str);

    if(new_word == NULL){
            fprintf(stderr, "Error creating a new word node.\n");
            exit(0);
        }

    new_word->str = (char*)malloc(sizeof(*str));
    strncpy(new_word->str, *str, length);
    //new_word->str = *str;
    new_word->length = length;
    printf("%s ", new_word->str); //test print

    new_word->left = NULL;
    new_word->right = NULL;
    new_word->down = NULL;

    insert(head, tail, new_word);

    return;
}


void tokenize(word **head, word **tail, char words_buffer[]){ 

    char *cur; 

    cur = strtok(words_buffer, " .,;()\t\r\v\f\n");

    *cur++ = '\0';
    create(head, tail, &cur);

    /* tokenize the next string and reset the "duplicate" variable */
    while((cur = strtok(NULL, " .,;()\t\r\v\f\n")) != NULL){
        //cur = strtok(NULL, " .,;()\t\r\v\f\n"); 
        *cur++ = '\0';      
        if(cur){
            create(head, tail, &cur);
        }
    }

}

int main(int argc, char *argv[]){ 

    FILE *fp;
    word *head = NULL; 
    word *tail = NULL;

    /*if(argc<3){
        printf("Failure: not enough arguments");
        return -1; 
    }*/

    fp = fopen(argv[1], "r");
    fseek(fp, 0, SEEK_END);
    char words_buffer[ftell(fp)+1];
    fseek(fp, 0, SEEK_SET);

    if(fp==NULL){
        printf("Failure: unreadable file");
        return -1;
    }

    while(fgets(words_buffer, sizeof(words_buffer), fp)){
            if(strlen(words_buffer)>1)
                tokenize(&head, &tail, words_buffer);
    }

    //print_list(&head);

    fclose(fp);
    return 0;
} 

I've left my test string printing for your reference. You will also notice that I don't use print_list right now, as I have yet to store strings correctly.

Because of the garbage at the end, I am assuming I am either incorrectly using the pointer to the string, or am malloc()ing too much space. As for the truncation, I'm not sure, but I presume it is related to my *cur++ = '\0'; line.

Any help is greatly appreciated, thanks for taking the time to take a look.


Solution

  • You are not copying the whole string with your strncpy().

    In fact, you are copying one character too few when you obtain the length with:

    int length = strlen(*str);

    As stated in the strncpy() manpage:

    Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.

    So make sure that when you use functions operating on null-terminated strings, such as most of the standard library str*() functions, that you account for the '\0' terminator with:

    int length = strlen(*str) + 1;

    Also, as an aside, the void * returned by malloc() is implicitly converted to any object pointer type, so instead of:

    word *new_word = (word*)malloc(sizeof(word));

    you should simply use:

    word *new_word = malloc(sizeof(word));

    or even better:

    word *new_word = malloc(sizeof *new_word);

    to avoid errors caused by changing the pointer type in the declaration but not the malloc() call.

    The sizeof operator doesn't evaluate non-variable-length array expressions, so this is a much more reliable way to obtain an object's size.

    EDIT

    As to the first character of each string missing, I would assume that is due to:

    *cur++ = '\0';
    

    as that just uselessly sets cur[0] to '\0', and then increments the pointer; the string now starts at the second letter of your word.