I am having a problem regarding a Log Searching Speed and Disk Size. It is extremely big, it has about 220 millions rows and 25 gigabyte disk size and takes several minutes to fetch some selects.
How does it work? The log is saved in the database using Sql Anywhere, currently version 9 and soon will be migrated to 11 (we tried to 12, but due some driver and some problems, we went back to 11).
The log consists with two tables (name changed to english so the people here are able to understand):
Id, DateTime, User, Url, Action and TableName. Action is what the used did: insert/delete/update TableName is which table in the database was affected.
Id, LogTable_Id, FieldName, NewValue, OldValue. LogTable_Id is foreign key from LogTable. FieldName is the field of the table from DB.
Important to note that NewValue and OldValue are type of varchar. Because it's recorded every kind of fields from other tables (datetime, int, etc).
Why it was made this way? Because we must record everything important. The system is made to an Institutional Department of Traffic (i don't know if it's spelled this way in proper english, but now you can an ideia what this is about) and sometimes they demand some kind of random report.
Until now, we have made our report simply doing some SQL select. However it takes several minutes to complete, even if datetime filtered. Isn't and issue to complain when it's not request often.
But they are demanding more and more reports that it is necessary to create a feature in the software with a nice and beauty report. As we never know theirs needs, we must go back to log and unbury the data.
Some information requested are only in the log. (e.g what user gave improperly access of the vehicle to someone)
Some ideas suggested until now:
Idea 1: I did some researches and I was told to work with NoSql using CouchDB. But the little i read i feel NoSql isn't a solution for my problem. I can't argue why for non experience in it.
Idea 2: Separate the Log Tables physically from the Database or from the machine.
Idea 3: Create a mirror from every table with a version field to keep history.
I'd like a macro optimization or architecture change if needed.
This seems like a pretty standard audit table. I'm not sure you need to go to a NoSQL solution for this. 220mil rows will be comfortably handled by most RDBMs.
It seems that the biggest problem is the table structure. Generally you flatten the table to improve logging speed and normalize it to improve reporting speed. As you can see these are conflicting.
If you were using something like MS SQL, you could build a single flat table for logging performance, then build a simple Analysis Services cube on top of it.
Another option would be to just optimize for reporting assuming you could maintain sufficient logging throughput. To do that, you may want to create a structure like this:
create table LogTable ( LogTableID int identity(1,1), TableName varchar(100), Url varchar(200) ) create table LogUser ( LogUserID int indentity(1,1), UserName varchar(100) ) create table LogField ( LogFieldID int identity(1,1), FieldName varchar(100), ) create table LogData ( LogDataID bigint identity(1,1), LogDate datetime, LogTableID int references LogTable(LogTableID), LogFieldID int references LogField(LogFieldID), LogUserID int references LogUserID(LogUserID), Action char(1), -- U = update, I = insert, D = delete OldValue varchar(100), NewValue varchar(100) )
This should still be fast enough to log data quickly, but provide enough performance for reporting. Index design is also important, generally done in order of increasing cardinality, so something like LogData(LogTableID, LingFieldID, LogDate). You can also get fancy with partitioning to allow for parallelized queries.