the current node of my trie stores a char and a position (which is the position in the text where the current word being read ends).
if I read a word "foo" in position 100 and another word "foo" in position 200, how can my node store this 2 ocurrences? Is there a fast way (with arrays or something even faster to implement) or I'll need to implement linked lists?
So, your trie node looks something like
struct trie_node {
struct trie_node *next;
strict trie_node *child;
wchar_t character;
off_t position;
};
You might use size_t position;
instead if the data is always in memory, of course.
If we assume many prefixes do not map to a specific position (because they are not complete words), it might be useful to use a separate array for the positions, ie.
struct positions {
size_t count_max;
size_t count;
off_t position[];
};
struct trie_node {
struct trie_node *next;
struct trie_node *child;
wchar_t character;
struct positions *position;
};
Character nodes that do not correspond to complete words, can have a NULL position
member. The count_max
corresponds to the number of positions allocated for, and count
corresponds to the current number of positions. The array can be reallocated and resized when necessary. This kind of array resizing is common in real-world applications; the overhead (of reallocation) is considered quite acceptable, especially compared to the alternatives.
Another interesting option would be to use a linear array to represent the words in the text in the order as they appear, with the position
member in the trie node specifying the index of the first occurrence in the array. Each array entry would contain the index of the next occurrence, plus optionally a link back to the trie node:
#include <stdlib.h>
#include <limits.h>
struct trie_node {
struct trie_node *next;
struct trie_node *child;
wchar_t character;
size_t index; /* NO_INDEX if no occurrences */
size_t occurs; /* Num of occurrences, optional */
wchar_t word[]; /* Optional, entire word */
};
/* When 'index' refers to 'none', use: */
#define NO_INDEX SIZE_MAX
struct occurrence {
off_t offset;
size_t next;
struct trie_node *node; /* Optional */
};
A container structure would have the array, and the trie would hang off it:
struct text {
size_t count_max;
size_t count;
struct occurrence *occurrences;
struct trie_node *trie;
};
Your functions would then take a pointer to a struct text
.
The occurrences
array in the struct text
can be dynamically reallocated as needed. (This is also why the first
member in the trie node is an index to the array, and not a pointer: if it was a pointer, we might have to go through the entire trie to update all nodes' pointers, when reallocating the array, otherwise.)
Note that because we use a size_t
as an index to the array, NO_INDEX
is the largest possible value, and size_t
is an unsigned integer type, it is sufficient to check if (i < count)
to verify that index i
is valid.
Each trie node that corresponds to a full word has index != NO_INDEX
, and the C99 flexible array member word
initialized to the full word (including the trailing L'\0'
). The occurs
member would have the number of occurrences of the word, if it is useful. (There is no need for it, other than that we humans might be interested in the number of occurrences of each word.)
This scheme allows a direct access to the sequence of words in the text.
If the occurrences are in increasing offset in the array, one can use a binary search to find the word that is between specific offsets. Because each occurrence has a back link to the trie node, which contains the full word in the word
member, it is easy to print out any word that occurs in the file, without having to scan through the entire trie.
I wrote this answer, because I wanted to show how combining two very different data structures in this manner, can open up very efficient methods of accessing the data. I cannot say whether it is useful, because usefulness depends on the problem that is being solved.