string artificial-intelligence substring data-mining longest-substring

Data-mining algorithm for dynamically consolidating recurring substrings?

I am trying to construct an artificial intelligence unit. I plan to do this by first collecting sensory input ('observations') into a short-term working-memory list, continually forming patterns found in this list ('ideas'), and committing those ideas to a long-term storage memory when they reach a substantial size, perhaps seven chained observations. For any philosophy folks, similar to Locke's Essay on Human Understanding, but this won't be Tabula Rasa. There needs to be an encoded underlying structure.

Thus, my question is:

Is there/where is a good algorithm for dynamically consolidating or 'pattern-izing' largest substrings of this constantly growing observations string? For example: if I have thus far been given ABCDABCABC, I want an ABC idea, D, and two other ABC ideas; then, if another D is observed and added to the short-term memory, I want an ABCD token, an ABC token, and another ABCD token. I don't want to use Shortest Common Substring, because I would need to rerun it after an arbitrary number of character additions. I think I'd prefer some easily searchable/modifiable tree structure.

Does this look like a decent enough solution? http://www.cs.ucsb.edu/~foschini/files/licenza_spec_thesis.pdf. If nothing else, I think the other data-miners may enjoy.

Solution

First step: the tokeniser. Define what you consider {A,B,C,D} and what not.

you need at least one extra token for garbage/miscellanious stuff (the good news is, that if this token occurs, the statemachine that follows will always be reset to its starting state)
you may or may not want to preserve whitespace (which would again cause an extra token, and a lot of extra states later in the DFA or NFA recogniser)
maybe you need some kind of equivalence class: eg wrap all numeric strings to one token type; fold lower/uppercase; accept some degree of misspelling (difficult!)
you might need special dummy token types for begin of line/end of line and the like.
you must make some choice about the amount of false positives versus the amount of false negatives that you allow.
if there is text involved make sure that all the sources are in the same canonical encoding, or preprocess them to bring them into the same encoding.

Building the tokeniser is an excellent way to investigate your corpus: if it is real data from the outside world, you will be amazed about the funky cases you did not even knew they existed when you started!

The second step (the recogniser) will probably be much easier, given the right tokenisation. For a normal deterministic statemachine (with predefined sequences to recognise) you can use the standard algorithms from the Dragon Book, or from Crochemore.

For fuzzy self-learning matchers, I would start by building Markov-chains or -trees. (maybe Bayes-trees, I am not an expert on this) . I don't think it will be very hard to start with a standard state-machine, and add some weights and counts to the nodes and edges. And dynamically add edges to the graph. Or remove them. (this is where I expect to starts to get hard)

A strategic decision: do you need a database? If your model fits in core, you won't and you should not. (databases are not intended to fetch one row and process it, then store it then fetch the next row, etc) If your data does not fit in core, you'll have more than a data-modelling problem. BTW: all the DNA-assemblers/matchers that I know off, work in core and with flatfiles. (maybe backed up by a database for easy management and inspection)