Imagine you have a table 'users' with 10 Mio records and a table 'groups' with 1 mio records. In average you have 50 users per group which I would store at least in an rdbms in a table called users2groups. users2groups is in fact a sparse matrix. Only 80% of the full dataset of users and groups fit into available memory. The data for the group membership (users2groups) comes on top, so that if memory is needed to cache group memberships this has to be deallocated from either the users or the groups table or both.
I would like to be able to:
From experiences I know, that disk latencies determine your access speed to a good extend. Also you can balance between read and write speed. However before one can do this one has to decide for a database type these days... such as:
So the question is which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?
Since this is a conceptional question disk space and cpu capacity are out of scope or considered to be available "indefinitely".
Btw. I am aware that searching for names such as user names or group names can efficiently be speeded up by the use of indexes based on hashes (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING)). However the hashes and their indexes are not included in the memory yet, when I said the users and groups table have 80% RAM coverage. That's because I am unclear, if there is not a better integrated solution. At the moment I just assume, that breaking up the whole topic into three pieces users, groups and users2groups is most sensible. I am lacking proof here.
-------------------- UPDATE -------------------------------------
I understand that there are competing concepts on the table:
As denormalization means: 'blowing up data volumes' these both concepts seem to contradict each other. Are there best practices or scientific or sensible arguments, when to use denormalization and when to use the squeeze data approach? Eg. a KPI saying: If less than 80% fit into memory, go for denormalization or so?
-------------------- UPDATE -------------------------------------
Extra memory costs extra money, most db servers have usually lots of empty disk space getting bored on their feet. So denormalization is cheap. On the other hand denormalization opportunities are limited: Disk latency physically limits the amount of max queries per second, full stop. So too many queries against disk get queued which constrains denormalization to an extend for applications with lots of traffic. Even denormalized data access speed depends on memory to a large extend.
So maybe KPIs are not possible here. Anyway for the given sparse matrix expample how does denormalization and the squeeze data approach need to be balanced? I thought at compressing the users and groups table, leave them in a rdbms and than assign the freed memory to a cache of a document db which serves the users2groups relations. However this introduces a whole set of new issues such as muliple round trips to deal with 2 database systems and more complicated query management. So how to tackle this down?
----------------------- UPDATE -----
As per suggestion of lagivan the sparse matrix with only flagging relations seems to be solved in a sensible way: have 2 tables users and groups and then have in the table users a multipe ID field with IDs related to groups and vice versa have in table groups a multiple ID fields with fields related to users. I guess this solution is not tightly coupled to a specific technology. It could even be implemented in MYSQL via VARBINARY or any blobs.
The outstanding part of the question is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate. So using foreign key arrays would sort of disable those information by concept. So the original questions for such scenario is still open: Which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?
This aims the part of the question which is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate.
I tried to generalize the first part of the answer: In fact I did not find a real reason, why to change away from RDBMS into any other technology to solve sparse matrixes better. So let's consider RDBMS (where denormalized data can be stored in varbinary or blobs) only. I am a big fan of normalization. However what I learned now, is: Denormalize, if denormalization results in lower memory consumption considering data AND index data. Normalization rules target to optimize data's memory consumption without taking into account, that there are scenarios such as sparse matrixes (with indexed foreign-foreign-key-pairs) that can easily confuse benefits and efforts of normalization. I sort of also (re-)learned, that squeeze data into memory as good as possible is THE key to performance (also lagivan argued on performance leverage based on caching).
Having said that, there are various options to go for the second part on the answer:
the one's from Lagivan's Update +
The solution is now to calculate memory consumption for each acceptable solution for index AND data and then pick the option with the lowest consumption value. Btw there are different levels of implementation effort related to each option.