I have data that needs to stay in the exact sequence it is entered in (genome sequencing) and I want to search approximately one billion nodes of around 18 members each to locate patterns.
Obviously speed is an issue with this large of a data set, and I actually don't have any data that I can currently use as a discrete key, since the basis of the search is to locate and isolate (but not remove) duplicates.
I'm looking for an algorithm that can go through the data in a relatively short amount of time to locate these patterns and similarities, and I can work out the regex expressions for comparison, but I'm not sure how to get a faster search than O(n).
Any help would be appreciated.
Thanks