We have a number of systems that daily generate a total of around 5M events. Currently we are saving these for around 10 days totaling around 40-50M events. Currently we're using an RDBMS as the persistance layer with a web-GUI slapped onto it, but we are experiencing certain performance problems.
An event consists of 20-30 fields composed of the following:
Roughly 5-6 of the fields are identifiers, most of them unique, representing the event itself, the business entity/object, the context and similar. Using these identifiers we can also relate events to each other chaining them together. The time difference in an event chain may be hours or in rare cases even days.
Currently we use the solution for analysis of individual event chains, mostly for error and outlier analysis (where did my order go?). In the future we may also like to gather statistics about events and event chains (how many orders per day? how many orders are handled by system X?). If possible the solution should also be able to grow to at least double the current size (we foresee an increase in the number of events as new systems are enabled). Analysis is today currently performed by human beings so search needs to be tolerable (searching for an event chain should take seconds, not minutes). The datastore should also allow for cleaning of stale events.
As mentioned in the beginning we're using a standard RDBMS for this. We were using a fairly normalized structure which we've now started denormalizing to try to increase performance. I can't help wondering whether some other solution might be better though. I've started looking around at different NoSQL databases (and in my own opinion MongoDB seems promising) but also trying to gather information concerning search engines and similar (Solr and ElasticSearch e.g.).
The question is what type of data store/solution would be a good fit for these events? Should we head into the NoSQL space, is perhaps a search engine what we want, or are we barking up the wrong tree when what we really need is to find someone who's really good at optimizing RDBMS:s?
I would suggest a hibrid solution with a conventional SQL server for the actual storage and a Lucene based frontend search engine, that is populated from the SQL based on some automatic or timed event. The web layer queries the Lucene layer and writes the SQL.
An SQL backend keeps your options open for the future (OLAP??, etc) and also provides a standard,scalable and multiuser way to accept data from the world through the dbconnection libraries and ui tools. In short if your data is stored in SQL you can not be lost...
The Lucene layer provides extreme query performance if the query capabilities it provides suffices. (In a nutshell: field value search for numbers, dates, strings, etc,range search,multiple field value search (field is an array actually), all with logical operators and logicalbinary expressions, sorting and paging. HOWEVER! it can not do groupings and sum, avg etc aggregating functions).
UPDATE: several years passed. Solr now has statistical capabilities like sum, avg, etc...
Query Performance: in a 100M record item database selecting a couple of hundred items with multifield query predicate is under 100ms.
Populating the index takes a constant time (no increase on size) because of the internal splitfile implementation. It is possible to build up a 5 million line index in minutes, 20 tops depending on mainly your storage controller. Lucence however supports realtime update to the index, a feature that we have used extensively with success on high load websites.
Lucene supports splitting and index into subindexes and index hierarchies so you can create an index per day but search in all of them (or in a specific subset of them) with a single query (using the multi-index adapter). I tried it with 2000 unique index files and the performance was amazing.
These architecture can be done without much effort in Java and .NET, both has great SQL and Lucene support