Search code examples
chashtableassociative-array

C: sum of integer values by string identifiers


So, I have two files of financial data, say 'symbols', and 'volumes'. In symbols I have strings such as:

FOO
BAR
BAZINGA
...

In volumes, I have integer values such as:

0001387
0000022
0123374
...

The idea is that the stock symbols will repeat in the file and I need to find the total volume of each stock. So, each row where I observe foo I increment total volume of foo by the value observed in volumes. The problem is that these files can be huge: easily 5 - 100 million records. A typical day may have ~1K different symbols in the file.

Doing it using strcmp on symbols each new line will be very inefficient. I was thinking of using an associative array --- hash table library which allows string keys --- such as uthash or Glib's hashtable.

I am reading some pretty good things about Judy arrays? Is the licensing a problem in this case?

Any thoughts on the choice of an efficient hash-table implementation? And also, whether I should use hash tables at all or perhaps something else entirely.

Umm.. apologize for the omission earlier: I need to have a pure C solution.

Thanks.


Solution

  • My solution:

    I did end up using the JudySL array to solve this problem. After some reading, the solution was quite simple to implement using Judy. I am replicating the solution here in full for it to be useful to anyone else.

    #include <stdio.h>
    #include <Judy.h>
    
    const unsigned int BUFSIZE = 10; /* A symbol is only 8 chars wide. */
    
    int main (int argc, char const **argv) {
    
      FILE *fsymb = fopen(argv[1], "r");
      if (fsymb == NULL) return 1;
    
      FILE *fvol = fopen(argv[2], "r");
      if (fvol == NULL) return 1;
    
      FILE *fout = fopen(argv[3], "w");
      if (fout == NULL) return 1;
    
      unsigned int lnumber = 0;
      uint8_t symbol[BUFSIZE];
      unsigned long volume;
    
      /* Initialize the associative map as a JudySL array. */
      Pvoid_t assmap = (Pvoid_t) NULL;
      Word_t *value;
    
      while (1) {
    
        fscanf(fsymb, "%s", symbol);
        if (feof(fsymb)) break;
    
        fscanf(fvol, "%lu", &volume);
        if (feof(fvol)) break;
    
        ++lnumber;
    
        /* Insert a new symbol or return value if exists. */
        JSLI(value, assmap, symbol);
        if (value == PJERR) {
            fclose(fsymb);
            fclose(fvol);
            fclose(fout);
            return 2;
        }
        *value += volume;
    
      }
    
      symbol[0] = '\0'; /* Start from the empty string. */
      JSLF(value, assmap, symbol); /* Find the next string in the array. */
      while (value != NULL) {
        fprintf(fout, "%s: %lu\n", symbol, *value); /* Print to output file. */
        JSLN(value, assmap, symbol); /* Get next string. */
      }
    
      Word_t tmp;
      JSLFA(tmp, assmap); /* Free the entire array. */
    
      fclose(fsymb);
      fclose(fvol);
      fclose(fout);
      return 0;
    
    }
    

    I tested the solution on a 'small' sample containing 300K lines. The output is correct and the elapsed time was 0.074 seconds.