My use case is as follows: I have to store and query data from multiple third-party sources. The only predefined schema knowledge I have is that it consists of Entity-Attribute-Value tuples with an additional Source component: (E, A, V, S)
. I do not know in advance which attributes will be present (so using the actual attributes themselves as columns is a problematic). The combination (E,A,V)
must be unique, and I therefore thought using that as the composite key would be the best way to model this, so we get this:
CREATE TABLE t1 (
E text,
A text,
V text,
S text,
PRIMARY KEY(E, A, V)
);
This will enable me to ask queries like:
The plan is to create mirrors of this table with differently ordered compound keys so that I can also answer queries of the form:
and so on and so forth (effectively, the mirror tables play the role of indexes, and to get full indexing I will need 6 copies of effectively the same data - not sure about the scalability of that approach yet, but that's a separate question I guess).
So far so good, however, the part I'm struggling with is this: V
is in fact an object in its own right, with multiple properties. If this were a relational model, I would have V
be a foreign key field pointing to a relation that maps its id
to, for example, a type
field and a value
field. But getting rid of foreign keys (and the joins that go with them) is more or less the point of the BigTable approach I guess, so I'm looking for a way to incorporate this into my table t1
.
Of course, I could do something like this:
CREATE TABLE t1 (
E text,
A text,
V_id text,
S text,
V_type text,
V_value text,
PRIMARY KEY(E, A, V_id)
);
But the problem I see is that this fails to capture the (inverse-)functional relation between V
's id, type and value: with the above table, I could get, for example:
E | A | V_id | V_type | V_value
---+----+------+--------+--------
a1 | b1 | 1 | X | foo
a1 | b1 | 2 | X | foo
a1 | b2 | 1 | Y | bar
While I want to be able to ensure that given a V_id
, type and value are unique, and vice versa. I guess what I'm after is what in the older Cassandra versions would be a nested super column, but I'm trying to achieve what I need with CQL3.
I briefly looked into collection types, but that doesn't quite seem to fit my use case.
Can anyone suggest a better way to model this, keeping in mind that ideally, I'd like to be able to retrieve (E, A, V)
and in as few queries as possible? Or am I just overthinking it and is the approach as I currently have it in fact fine (I can of course try and ensure uniqueness at the application level)?
Jeen, consider using map for V. See http://www.datastax.com/docs/1.2/cql_cli/using/collections Map, list, set types were introduced to avoid situations like encoding/decoding key-value pairs into and from string/byte arrays.