The Use case:
For a game I am collecting the results of each game match. It's always Team A against Team B. Each team consists of 5 players each picking a champion and the possible outcome of a match is for one team either Won / Lost or for both teams a draw.
I would like to figure out the best champion combinationsI want to create win/lose/draw statistics based on the chosen champion combination of each team. In total there are ~100 champions a player can chose from. So there are many different champion combinations possible.
More (bonus) features:
My problem:
I wonder what's the best way to collect the statistics based on the champion combination? How would the data modelling look like?
My idea:
Create a hash of all championId
in a combination which would literally represent a championCombinationId
which is a unique identifier for the champion combo a team uses.
Create a two dimensional table which allows tracking combination vs combination stats. Something like this:
Timeframes (daily dates) and the actual championId
s for a combinationId
are missing there.
I tried myself creating a model for the above requirements, but I am absolutely not sure about it. Nor do I know what keys I would need to specify.
CREATE TABLE team_combination_statistics (
combinationIdA text, // Team A
combinationIdB text, // Team B
championIdsA text, // An array of all champion IDs of combination A
championIdsB text, // An array of all champion IDs of combination B
trackingTimeFrame text, // A date?
wins int,
losses int,
draws int
);
This question is quite long so I'll talk about different topics before suggesting my approach, be ready for a long answer:
Storing total ammount of data is useful but ordering by it isn't, as the order doesn't determine if a combination is good vs another, it determines the combination that most times have won/lost vs the opposite but the total ammount of games played also matters.
When ordering the results, you want to order by win-ratio, draw-ratio, loose-ratio of two of the previous as the third is a linear combination.
The problem on two-dimensional tables where both dimensions represent the same data, in this case a group of 5 champs, is that either you make a triangular table or you have data doubled as you will have to store cominationA vs combinationB and combinationB vs combinationA, being combinationX a specific group of 5 champs.
There are two aproaches here, using triangular tables or doubling the data manually:
You create a table where either the top right half is empty or the bottom left hand is empty. You then handle in the app which hash is A and which is B, and you may need to swap their order, as there is no duplicate data. You could for example consider alphabetical order where A < B always. If you then request the data in the wrong order you would get no data. The other option would be making both A vs B and B vs A query and then joining the results (swapping the wins and looses obviously).
By making two inserts with reflected values (A, B, wins, draws, looses & B, A, looses, draws, wins) you would duplicate the data. This lets you query in any order at the cost of using two times the space and requiring double inserts.
The pros of one approach are the cons of the other.
I would probably use the triangular tables approach as the application complexity increase is not that big to be relevant, but the scalability does matter.
Use whatever keyspace you want, I choose so from stackoverflow. Modify the replication strategy or factor as needed.
CREATE KEYSPACE so WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
The champion table will contain info about the different champions, for now it will only hold the name but you could store other things in the future.
CREATE TABLE so.champions (
c boolean,
id smallint,
name text,
PRIMARY KEY(c, id)
) WITH comment='Champion names';
A boolean
is used as the partition key as we want to store all champs in a single partition for query performance and there will be a low ammount of records (~100) we will always be using c=True
. A smallint
was choosen for the id
as 2^7 = 128 was to close to the actual number of champs and to leave room for future champs without using the negative numbers.
When querying the champs you could get them all by doing:
SELECT id, name FROM so.champions WHERE c=True;
or request a specific one by:
SELECT name FROM so.champions WHERE c=True and id=XX;
This table will store the results of the matches without agregating:
CREATE TABLE so.matches (
dt date,
ts time,
id XXXXXXXX,
teams list<frozen<set<smallint>>>,
winA boolean,
winB boolean,
PRIMARY KEY(dt, ts, id)
) WITH comment='Match results';
For the partition of an historic data table, and as you mentioned daily precission, date
seems to be a nice partition key. A time
column is used as the first clustering key for ordering reasons and to complete the timestamp, doesn't matter if these timestamp belong to the ending or finishing instant, choose one and stick with it. An additional identifier is required in the clustering key as 2 games may end in the same instant (time has nanosecond precission which would basically mean that the data lost to overlap would be quite insignificant but your data source will probably not have this precission, thus making this last key column necesary). You can use whatever type you want for this column, probably you will already have some king of identifier with the data that you can use here. You could also go for a random number, an incremental int managed by the application, or even the name of the first players as you can be sure the same player will not start/finish two games at the same second.
The teams
column is the most important one: it stores the ids of the champs that were played in the game. A sequence of two elements is used, one for each team. The inner (frozen) set is for the champs id in each team, for example: {1,3,5,7,9}
. I've tried a couple different options: set< frozen<set<smallint>> >
, tuple< set<smallint>> , set<smallint> >
and list< frozen<set<smallint>> >
. The first options doesn't store the order of the teams, so we would have no way to know who win the game. The second one doesn't accept using an index on this column and doing partial searchs through CONTAINS
so I've opted for the third that does keep the order and allows partial searchs.
The other two values are two booleans representing who won the game. You could have additional columns such a draw boolean
one but this one is not necesary or duration time
if you want to store the length of the game (I'm not using Cassandra's duration
type on purpouse as it is only worth when it takes months or at least days), end timestamp
/start timestamp
if you want to store the one that you are not using in the partition and clustering key, etc.
It may be useful to create an index on teams so that you are allowed to query on this column:
CREATE INDEX matchesByTeams ON so.matches( teams );
Then we can execute the following SELECT
statenments:
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9};
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9} AND dt=toDate(now());
The first one would select the matches in which any of the teams selected that composition and the second one will further filter it to today's matches.
With these two tables you can hold all the info, and then request the data you need to calculate the stats involved. Once you calculate some data, you could store this info back in Cassandra as a "cache" in an additional table so that when a user requests some stats to be shown, you first check if they were already calculated and if they weren't calculate. This table would need to have a column for each parameter that the user can enter, for example: champion composition, starting date, final date, enemy team; and additional columns for the stats themselves.
CREATE TABLE so.stats (
team frozen<set<smallint>>,
s_ts timestamp,
e_ts timestamp,
enemy frozen<set<smallint>>,
win_ratio float,
loose_ratio float,
wins int,
draws int,
looses int,
PRIMARY KEY(team, s_ts, e_ts, enemy)
) WITH comment="Already calculated queries";
To get the results order by ratios instead of enemy team you can use materialized views.
CREATE MATERIALIZED VIEW so.statsByWinRatio AS
SELECT * FROM so.stats
WHERE team IS NOT NULL AND s_ts IS NOT NULL AND e_ts IS NOT NULL AND win_ratio IS NOT NULL AND enemy IS NOT NULL
PRIMARY KEY(team, s_ts, e_ts, win_ratio, enemy)
WITH comment='Allow ordering by win ratio';
NOTE:
While I was answering I realized that introducing the concept of "patch" inside the DB so that the user is not allowed to determine dates but patches could be a better solution. If you are interested comment and I'll edit the answer to include the patch concept. It would mean modifying both the so.historic
and so.stats
tables a bit, but quite minor changes.