c++performance finite-automata state-machine memory-efficient

Space and Time Efficient Encoding of Transitions for Finite Automata

I have been thinking about how to space efficiently encode transitions for finite automata and also guarantee a quick lookup time. An idea that sounded good to me was to use for example ints if I know I have at most 32 outgoing transitions per state to space efficiently encode transition symbols as 1s or 0s in the bits of an int.

So, I have a class that maps a type T (for example strings) to ints. ID(string) returns the ID that the string as been assigned as its integer-encoding. Adding the strings "fish", "cat" and "tree" one after the other into an empty ID object would assign 0 to "fish", 1 to "cat" and 2 to "tree".

Later I would associate powers of 2 with individual transition symbols. The power is determined by the ID assigned to the transition symbol.

If the ID class had been fed the English alphabet rather than "fish", "cat" and "tree" the resulting mapping would be

a : 0
b : 1
c : 2
...
j : 9
...
z : 26

The outgoing_symbols field of a state having the outgoing edges "a", "c", "e" and "f" would therefore look like this:

00000000 00000000 00000000 00110101
      zy xwvutsrq ponmlkji hgfedcba

Now I can simply do state.outgoing_symbols+=pow(2,ID(transition_symbol)) when adding a transition to an existing state.

I would do state.outgoing_symbols+=pow(2,ID(j)) to add 2^9 to outgoing_symbols resulting in

00000000 00000000 00000010 00110101
      zy xwvutsrq ponmlkji hgfedcba

The advantage of this representation is that I can store 32 symbols in a single int and I can do constant time querying of states whether they have a transition with a given symbol:

Assume delta is a vector of structs like this

┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │...│   │   │   │   │   │   │   │   │   │   │ n │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
└───┴───┴───┴───┴───┴───┴───┴─┬─┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                              │ 
                              │
                              │
                              │
                              │
                              ├───────────────────────────────────────────────────────────────────────┐
                              │   sym_enc outgoing_symbols     00000000 00001000 10000010 00110100    │
                              │                                                                       │
                              │   T mapping_of_symbols_onto_target_states                             |
                              └───────────────────────────────────────────────────────────────────────┘

which maps state IDs 0 through n to structs of outgoing_symbols and a mapping from symbols to target states. Then I can write this function:

bool has_outgoing_symbol(int state, int symbolID)
{
    return delta[state].outgoing_symbols & pow(2, symbolID) == symbolID;
}

The big issue is that I so far have not associated transition symbols with target states and I cannot think of any way to use this very efficient encoding of transitions together with an efficient encoding of the necessary mapping.

I could have 2 vectors, one with the transition symbol IDs and one vector of vectors which store the IDs of the target states. The two vectors would be "synced up", so that for all i vector1[i] corresponds to vectpr2[i]. The reason to have 2 vectors rather than 1 vector of structs of the form

struct transition
{
    std::vector to_states;
    int transition_symbol;
};

is to utilize some processor magic that I don't understand where apparently it is better of have several vectors of simple types rather than one vector of structs of simple types.

In any case having to look up the target state in a linear or logarithmic lookup kind of makes the advantage of constant lookup for the transition symbol via encoding as powers of 2 go to waste. But I can't think of any way to make use of that encoding for the purposes of mapping the symbols onto target states.

Can anybody give me suggestion on either where to read up something like this or maybe even straight up have an idea how to do this?

Solution

If I understand you correctly, you want to store an entry in a vector for each symbol that has a bit set in the bitmask, and efficiently look up the entry given a symbol.

In that case, you can compute the index of the entry by counting the number of bits set in the mask for all symbols lower than the one you are checking:

int getSymbolIndex(int state, int symbolID)
{
      if(symbolID == 0)
        return 0;
    return NumberOfSetBits(delta[state].outgoing_symbols & ((1 << symbolID) -1));
}

Use the index returned to look up the entry in a vector of target states stored for the state. It only gives valid results for symbols that are actually in the set.

For an efficient way to count bits in an integer, see this question: How to count the number of set bits in a 32-bit integer?