There are several roll your own strategies for secondary indexes that handle concurrent updates, this for example:
http://www.slideshare.net/edanuff/indexing-in-cassandra
which uses 3 ColumnFamilies.
My question is, how is the PlayORM @NoSqlIndexed
annotation implemented; in terms of what extra ColumnFamilies are needed / created?
Additionally, are concurrent updates supported - ie, it would not be possible with two competing updates to have the index updated from one and the table from the other?
You can do concurrent updates with no locking.
Slide 46's question of Can't I get a false positive? is the same case with PlayOrm.
The one caveat is you may need to resolve on read. Example is thus. Say you have Fred with an address of 123 in the database.
Now, two servers make an update to Fred
This means your index may have a duplicate of 456.fred and 789.fred. You can then resolve this on read as the query WILL return Fred when you ask for people with address 456. There is another ticket out for us to resolve this on reads for you ;) and eliminate the entry.
We did ask about getting a change in cassandra where we could possibly do (add column 456.fred IF column 123.fred exists or fail) but not sure if they will ever implement something like that. That would propogate a failure back to the loser(ie. last writer gets exception). It would be nice but I am not sure they will do a feature like this.
BIG NOTE: Unlike CQL, the query is NOT sent to all nodes. It only puts load on the nodes that contains the index instead of all 100 computers. ie. it can scale better this way.
MORE DETAIL: On slide 27 of that presentation your link has, it is ALMOST like that for our indexes. The format does not contain the 1, 2, 3 though. The index format is
Indexes=
{"User_Keys_By_Last_Name":{
{"adams","e5d…"}: null,
{"alden","e80…"}: null,
{"anderson","e5f…"}: null,
{"anderson","e71…"}: null,
{"doe","e78…"}: null,
{"franks","e66…"}: null,
…:…,
}
}
This way, we can avoid the read to find out if we need to use a 1, 2, 3, 4, 5 for the second half of the name. Instead we use the FK which we know is unique and just have to do a write. Cassandra is all about resolving conflicts on a read anyways which is why the repair process exists. It is based on the fact that conflicts will happen a very low percentage of the time and just take a hit then at that low percentage.
LASTLY, you can just use the command line tool to view the index!!!! It batches stuff in about 200 columns each streaming back so you could have 1 million entries and the command line tool will happily just keep printing them until you ctrl-c it.
later, Dean