Search code examples
cstring-matchingtrie

How to make trie store reincidence of a word in C


the current node of my trie stores a char and a position (which is the position in the text where the current word being read ends).

if I read a word "foo" in position 100 and another word "foo" in position 200, how can my node store this 2 ocurrences? Is there a fast way (with arrays or something even faster to implement) or I'll need to implement linked lists?


Solution

  • So, your trie node looks something like

    struct trie_node {
        struct trie_node *next;
        strict trie_node *child;
        wchar_t           character;
        off_t             position;
    };
    

    You might use size_t position; instead if the data is always in memory, of course.

    If we assume many prefixes do not map to a specific position (because they are not complete words), it might be useful to use a separate array for the positions, ie.

    struct positions {
        size_t            count_max;
        size_t            count;
        off_t             position[];
    };
    
    struct trie_node {
        struct trie_node *next;
        struct trie_node *child;
        wchar_t           character;
        struct positions *position;
    };
    

    Character nodes that do not correspond to complete words, can have a NULL position member. The count_max corresponds to the number of positions allocated for, and count corresponds to the current number of positions. The array can be reallocated and resized when necessary. This kind of array resizing is common in real-world applications; the overhead (of reallocation) is considered quite acceptable, especially compared to the alternatives.


    Another interesting option would be to use a linear array to represent the words in the text in the order as they appear, with the position member in the trie node specifying the index of the first occurrence in the array. Each array entry would contain the index of the next occurrence, plus optionally a link back to the trie node:

    #include <stdlib.h>
    #include <limits.h>
    
    struct trie_node {
        struct trie_node  *next;
        struct trie_node  *child;
        wchar_t            character;
        size_t             index;     /* NO_INDEX if no occurrences */
        size_t             occurs;    /* Num of occurrences, optional */
        wchar_t            word[];    /* Optional, entire word */
    };
    
    /* When 'index' refers to 'none', use: */
    #define  NO_INDEX  SIZE_MAX
    
    struct occurrence {
        off_t              offset;
        size_t             next;
        struct trie_node  *node;    /* Optional */
    };
    

    A container structure would have the array, and the trie would hang off it:

    struct text {
        size_t             count_max;
        size_t             count;
        struct occurrence *occurrences;
        struct trie_node  *trie;
    };
    

    Your functions would then take a pointer to a struct text.

    The occurrences array in the struct text can be dynamically reallocated as needed. (This is also why the first member in the trie node is an index to the array, and not a pointer: if it was a pointer, we might have to go through the entire trie to update all nodes' pointers, when reallocating the array, otherwise.)

    Note that because we use a size_t as an index to the array, NO_INDEX is the largest possible value, and size_t is an unsigned integer type, it is sufficient to check if (i < count) to verify that index i is valid.

    Each trie node that corresponds to a full word has index != NO_INDEX, and the C99 flexible array member word initialized to the full word (including the trailing L'\0'). The occurs member would have the number of occurrences of the word, if it is useful. (There is no need for it, other than that we humans might be interested in the number of occurrences of each word.)

    This scheme allows a direct access to the sequence of words in the text.

    If the occurrences are in increasing offset in the array, one can use a binary search to find the word that is between specific offsets. Because each occurrence has a back link to the trie node, which contains the full word in the word member, it is easy to print out any word that occurs in the file, without having to scan through the entire trie.

    I wrote this answer, because I wanted to show how combining two very different data structures in this manner, can open up very efficient methods of accessing the data. I cannot say whether it is useful, because usefulness depends on the problem that is being solved.