I am trying to construct an artificial intelligence unit. I plan to do this by first collecting sensory input ('observations') into a short-term working-memory list, continually forming patterns found in this list ('ideas'), and committing those ideas to a long-term storage memory when they reach a substantial size, perhaps seven chained observations. For any philosophy folks, similar to Locke's Essay on Human Understanding, but this won't be Tabula Rasa. There needs to be an encoded underlying structure.
Thus, my question is:
Is there/where is a good algorithm for dynamically consolidating or 'pattern-izing' largest substrings of this constantly growing observations string? For example: if I have thus far been given ABCDABCABC, I want an ABC idea, D, and two other ABC ideas; then, if another D is observed and added to the short-term memory, I want an ABCD token, an ABC token, and another ABCD token. I don't want to use Shortest Common Substring, because I would need to rerun it after an arbitrary number of character additions. I think I'd prefer some easily searchable/modifiable tree structure.
Does this look like a decent enough solution? http://www.cs.ucsb.edu/~foschini/files/licenza_spec_thesis.pdf. If nothing else, I think the other data-miners may enjoy.
First step: the tokeniser. Define what you consider {A,B,C,D} and what not.
Building the tokeniser is an excellent way to investigate your corpus: if it is real data from the outside world, you will be amazed about the funky cases you did not even knew they existed when you started!
The second step (the recogniser) will probably be much easier, given the right tokenisation. For a normal deterministic statemachine (with predefined sequences to recognise) you can use the standard algorithms from the Dragon Book, or from Crochemore.
For fuzzy self-learning matchers, I would start by building Markov-chains or -trees. (maybe Bayes-trees, I am not an expert on this) . I don't think it will be very hard to start with a standard state-machine, and add some weights and counts to the nodes and edges. And dynamically add edges to the graph. Or remove them. (this is where I expect to starts to get hard)
A strategic decision: do you need a database? If your model fits in core, you won't and you should not. (databases are not intended to fetch one row and process it, then store it then fetch the next row, etc) If your data does not fit in core, you'll have more than a data-modelling problem. BTW: all the DNA-assemblers/matchers that I know off, work in core and with flatfiles. (maybe backed up by a database for easy management and inspection)