performance search indexing clustered-index

Fast regex search

What would be a way to somehow index 50-100GB of text lines and then be able to perform fast regex searches? At least faster than going line by line. The regex pattern is not always the same so can't take it into account when building the index.

Is it possible to achieve something like this with Lucene? I know it might be possible with suffix trees but the index takes too much memory (much more than those 100GB).

Solution

The main thing you have to do is identify the common search terms in advance, and then index based on that.

For instance, maybe you anticipate that there will be a lot of searches for lines starting with "Foo". Then you can run that search in advance and store a list of lines starting with "Foo". Then, if someone searches for lines starting with "Foobar", you've already got a narrowed-down subset of lines to search.

If you want to get really clever, you can programmatically analyze common searches to find recurring search components, and then index based on those common components.