Search code examples
calgorithmhashtable

Evaluating Improvements in C Program Using Hash Tables: A Beginner's Perspective


I'm new to C programming and I'm trying to understand the differences between two programs that check for duplicate elements in an array using hash tables with the uthash library. I've noticed some differences in their implementation, and I'm not sure if the changes in Program 1 are indeed improvements over Program 2. Could you help me understand these differences and their implications?

Program 1

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "uthash.h"

typedef struct {
    int key;
    UT_hash_handle hh;
} hash_table;

bool containsDuplicate(int* nums, int numsSize) {
    if (numsSize == 1) {
        return false;
    }

    hash_table *hash = NULL;
    hash_table *elem = NULL;
    bool flag = false;

    for (int i = 0; i < numsSize; i++) {
        HASH_FIND_INT(hash, &nums[i], elem);

        if (!elem) {
            elem = (hash_table *)malloc(sizeof(hash_table));
            elem->key = nums[i];
            HASH_ADD_INT(hash, key, elem);
        } else {
            flag = true;
            break;
        }
    }

    hash_table *tmp;
    HASH_ITER(hh, hash, elem, tmp) {
        HASH_DEL(hash, elem);
        free(elem);
    }

    return flag;
}

Program2

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "uthash.h"

typedef struct {
    int key;
    UT_hash_handle hh;
} hash_table;

hash_table *hash = NULL, *elem, *tmp;

bool containsDuplicate(int* nums, int numsSize){
    if (numsSize == 1) {
        return false;
    }

    bool flag = false;

    for (int i=0; i<numsSize; i++) {
        HASH_FIND_INT(hash, &nums[i], elem);

        if(!elem) {
            elem = malloc(sizeof(hash_table));
            elem->key = nums[i];
            HASH_ADD_INT(hash, key, elem);
        } else {
            flag = true;
            break;
        }
    }

    HASH_ITER(hh, hash, elem, tmp) {
        HASH_DEL(hash, elem); free(elem);
    }

    return flag;
}

From my beginner's perspective, the main differences I see are:

  • Program 1 declares the hash table hash locally within the containsDuplicate function, while Program 2 declares it globally.
  • Program 1 explicitly frees the hash table at the end of the function, which is also done in Program 2 but with a global hash table.

I'm trying to understand:

  • Are the changes in Program 1 really improvements over Program 2?
  • What are the implications of using a local vs. global hash table in terms of memory management and function behavior?
  • Are there any potential issues with using a global hash table across multiple function calls?
  • Which approach is generally considered better practice for a beginner like me, and why?

Solution

  • The only significant difference is that your 1st program uses local variables and your 2nd program uses global variables for your hash tables. Global variables make your program harder to reason about as you have to audit the whole program to figure out what reads or writes to any of those 3 global variables. Global variables make your code harder to test.

    If you have other uses for the hash table then it would make sense to keep it around rather than making it an implementation detail of containsDuplicate(). It costs O(n) to build the hash table but only O(1) to check if a new value is already present. Neither program supports this currently.

    Don't cast the void * from malloc(). It may hide type issues.

    Prefer sizeof *elem to sizeof(hash_table). The former is purely mechanical and it leads to less duplication.