Search code examples
chashhashtablecs50free

Problem freeing memory in a fancy hash table


I am taking cs50 and I have this pset where I should implement a hash table to load words in a dictionary to check the spelling of words in a text, it's like a word spelling check. I had the idea to add the sum of each character's ascii number as a first layer and the first letter as a second, each sum has an array from a to z and words are stacked in this array as nodes of "the word" and the "pointer" to the next. It works fine it's a bit slow but when I had to implement the unloading section to free each chunk of memory I allocated. But valgrind says that those last nodes in the while loop aren't freed. I don't know what I do wrong.

// Implements a dictionary's functionality

#include <ctype.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>

#include "dictionary.h"

// TODO: Choose number of buckets in hash table
const unsigned int SIZE_OF_ALPHABET = 26;
const unsigned int MAX_SUM = SIZE_OF_ALPHABET * LENGTH;

int number_of_words = 0;
// Represents a node in a hash table

typedef struct node
{
    char word[LENGTH + 1];
    struct node *next;
}
node;

typedef struct
{
    node *children[SIZE_OF_ALPHABET];
}
sn;

typedef struct
{
    sn *sums[MAX_SUM];
}
fn;

// Hash table
fn *root;
//node *table[N];

// Returns true if word is in dictionary, else false
bool check(const char *word)
{
    // TODO
    node *ptr;
    ptr = root->sums[hash(word)]->children[tolower(word[0]) - 'a'];
    while (ptr != NULL)
    {
        if (strcasecmp(ptr->word, word) == 0)
        {
            return true;
        }
        ptr = ptr->next;
    }
    return false;
}

// Hashes word to a number
unsigned int hash(const char *word)
{
    int sum = 0;
    // TODO: Improve this hash function
    for (int i = 0, l = strlen(word); i < l; i++)
    {
        if (isalpha(word[i]))
        {
            sum += tolower(word[i]) - 'a';
        }
    }

    return sum;
}

// Loads dictionary into memory, returning true if successful, else false
bool load(const char *dictionary)
{
    // TODO
    FILE *dict = fopen(dictionary, "r");
    if (dict == NULL)
    {
        return false;
    }

    //initialising the hash table

    //printf("%li\n", sizeof(fn));
    root = malloc(sizeof(fn));
    if (root == NULL)
    {
        return false;
    }

    for (int i = 0; i < MAX_SUM; i++)
    {
        root->sums[i] = malloc(26*(sizeof(node) - LENGTH + 1));
        if (root->sums[i] == NULL)
        {
            return false;
        }
        for (int j = 0; j < SIZE_OF_ALPHABET; j++)
        {
            root->sums[i]->children[j] = malloc(sizeof(node));
            if (root->sums[i]->children[j] == NULL)
            {
                return false;
            }
            root->sums[i]->children[j] = NULL;
        }
    }

    // iterate over every word and send it to be hashed
    int index = 0;
    int sum = 0;
    char c;
    while (fread(&c, sizeof(char), 1, dict))
    {
        char word[LENGTH + 1];

        if (c != '\n')
        {
            word[index] = c;
            if(isalpha(c))
            {
                sum += c - 'a';
            }
            index ++;
        }
        else
        {
            word[index] = '\0';

            // "Pushing" a node in the hash
            node *n = malloc(sizeof(node));
            if (n == NULL)
            {
                return false;
            }
            for (int i = 0; i < index + 1; i++)
            {
                n->word[i] = word[i];
            }
            //already did the same during lecture
            n->next = root->sums[sum]->children[word[0]-'a'];
            root->sums[sum]->children[word[0]-'a'] = n;

            number_of_words ++;
            sum = 0;
            index = 0;
            //free(n);
        }
    }
    fclose(dict);
    return true;
}

// Returns number of words in dictionary if loaded, else 0 if not yet loaded
unsigned int size(void)
{
    // TODO
    return number_of_words;
}

// Unloads dictionary from memory, returning true if successful, else false
bool unload(void)
{
    // TODO
    for (int i = 0; i < MAX_SUM; i++)
    {
        for (int j = 0; j < SIZE_OF_ALPHABET; j++)
        {
            node *ptr = root->sums[i]->children[j];
            if (ptr == NULL)
            {
                free(root->sums[i]->children[j]);
            }
            else
            {
                while (ptr->next != NULL)
                {
                    node *tmp = ptr;
                    ptr = ptr->next;
                    free(tmp);
                }
            }
        }
        free(root->sums[i]);
    }

    //free(root->sums);
    free(root);
    return true;
}

Explanation only, no code please.


Solution

  • While you are discovering some problems with this code (and posting some possible solutions in the comments above), it should be noted that the levels of indirection at play here are very different from the basic hashtable implementation described by the course materials. The question reads as an XY problem.

    For the X part of the problem:

    The goal of hashing in this exercise is to be able to produce a single integer value for every word encountered. How this integer value is derived is up to you, but should be the responsibility of the hash function. This value is used to index an array (representing a hashtable) that utilizes separate chaining for dealing with collisions.

    Looking at this implementation, it seems as though you read this note

    There are many ways to implement a hash function beyond using the first character (or characters) of a word. Consider a hash function that uses a sum of ASCII values or the length of a word. A good hash function tends to reduce “collisions” and has a fairly even distribution across hash table “buckets”.

    from pset5/Speller, and maybe took the description too literally, attempting to use two of the methods it describes (sum of characters and first character value) to produce two separate hash values. In wanting to use both these values, you have ended up with something rather over-engineered.

    A hashtable implementation closer to the intention of the problem set would be something like the following example, where the hashtable is represented as a fixed-length array of linked lists. Strings are hashed to an integer value, which locates a bucket (using the remainder operator (%) to ensure the index is valid). When a collision occurs, another node is prepended to the linked list located by the hash value.

    When it comes time to free the entire hashtable, each bucket is checked and freed like a normal linked list.

    Here is a simple example, with a basic REPL, to play with:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #define HASHTABLE_SIZE 128
    
    struct node {
        struct node *next;
        char *word;
    };
    
    static struct node *table[HASHTABLE_SIZE];
    
    static size_t hash(const char *string)
    {
        /* this is djb2 (http://www.cse.yorku.ca/~oz/hash.html),
         * but you should write your own hash function, per the instructions */
        size_t h = 5381;
    
        while (*string)
            h = (h << 5) + h + *string++;
    
        return h % HASHTABLE_SIZE;
    }
    
    static int contains(const char *string)
    {
        for (struct node *n = table[hash(string)]; n; n = n->next)
            if (0 == strcmp(n->word, string))
                return 1;
    
        return 0;
    }
    
    static int insert(const char *string)
    {
        if (contains(string))
            return 0;
    
        struct node *element = calloc(1, sizeof *element);
    
        if (!element)
            return 0;
    
        element->word = strdup(string);
    
        if (!element->word) {
            free(element);
            return 0;
        }
    
        unsigned long h = hash(string);
    
        element->next = table[h];
        table[h] = element;
    
        return 1;
    }
    
    static void empty(void)
    {
        for (size_t i = 0; i < HASHTABLE_SIZE; i++) {
            struct node *element = table[i];
    
            if (element) {
                table[i] = NULL;
    
                while (element) {
                    struct node *tmp = element;
                    element = element->next;
    
                    free(tmp->word);
                    free(tmp);
                }
            }
        }
    }
    
    int main(void)
    {
        char buffer[512];
    
        while (1) {
            printf("(< = input) (> = search) (# = dump) (. = exit): ");
    
            if (!fgets(buffer, sizeof buffer, stdin) || '.' == *buffer)
                break;
    
            buffer[strcspn(buffer, "\n")] = '\0';
    
            if ('#' == *buffer) {
                for (size_t i = 0; i < HASHTABLE_SIZE; i++)
                    for (struct node *n = table[i]; n; n = n->next)
                        printf("%zu) %s\n", i, n->word);
            } else if ('<' == *buffer) {
                char *b = buffer + strspn(buffer, "< ");
                printf("<%s> %s!\n", b, insert(b) ? "added" : "not added (already exists?)");
            } else if ('>' == *buffer) {
                char *b = buffer + strspn(buffer, "> ");
                printf("<%s> is%s in the hashtable!\n", b, contains(b) ? "" : " NOT");
            }
        }
    
        empty();
    }
    

    Example usage:

    (< = input) (> = search) (# = dump) (. = exit): < foo
    <foo> added!
    (< = input) (> = search) (# = dump) (. = exit): < bar
    <bar> added!
    (< = input) (> = search) (# = dump) (. = exit): > foo
    <foo> is in the hashtable!
    (< = input) (> = search) (# = dump) (. = exit): > qux
    <qux> is NOT in the hashtable!
    (< = input) (> = search) (# = dump) (. = exit): #
    9) foo
    58) bar
    (< = input) (> = search) (# = dump) (. = exit): .
    

    For completeness (the Y part of the problem), some issues with your example are:

    In load, the memory requirement for this allocation

    root->sums[i] = malloc(26*(sizeof(node) - LENGTH + 1));
    

    is confusing. root->sums[i] is of type sn *, so each allocation should be some factor of sizeof (sn) bytes.

    As you have discovered,

    root->sums[i]->children[j] = malloc(sizeof(node));
    if (root->sums[i]->children[j] == NULL)
    {
        return false;
    }
    root->sums[i]->children[j] = NULL;
    

    is a memory leak, as the pointer to freshly allocated memory is overwritten. You "solved" this leak by removing the assignment to NULL (and also assigning values to the structure members, pressumably to avoid issues with unload and check invoking undefined behaviour when the uninitialized members were used, e.g., as arguments strcasecmp, or as control variables).

    Really, these nodes are superfluous, acting as "empty" tail nodes. The same effect could be achieved by doing no allocation, and setting every element of root->sums[i]->children to NULL.

    In load, word is local to the while loop's block.

    while (fread(&c, sizeof(char), 1, dict))
    {
        char word[LENGTH + 1];
    

    You are currently relying on word to occupy the same memory location each interation (and contain the same values as the previous iteration), which is not guaranteed to be the case. The array should be moved outside the block:

    char word[LENGTH + 1];
    while (fread(&c, sizeof(char), 1, dict))
    {
    

    In load, for both these lines,

    n->next = root->sums[sum]->children[word[0]-'a'];
    root->sums[sum]->children[word[0]-'a'] = n;
    

    word[0]-'a' is problematic if the index produced is outside the range [0, SIZE_OF_ALPHABET).

    In unload, given

    node *ptr = root->sums[i]->children[j];
    

    then

    if(ptr == NULL)
    {
        free(root->sums[i]->children[j]);
    }
    

    is the exact same as free(NULL) (a non-operation).

    Additionally, the loop condition in

    while (ptr->next != NULL)
    {
        node *tmp = ptr;
        ptr = ptr->next;
        free(tmp);
    }
    

    ensures that ptr (tmp) is only freed when it is not the last node in the linked list.

    As an aside: your hash function currently collides on anagrams. Not the end of the world, but something to note.