Search code examples
pythonjoincassandracql

Find IDs in one table not present in another table


Given are two tables in Cassandra:

xs (id bigint, x bigint)
0,10
1,11
2,12
3,13
4,14

ys (id bigint, y bigint)
0,100
2,144
4,196

In this minimal example I would like to get all rows from xs having an id not contained in ys yet, i.e.

select
1,11
3,13

How can I achieve this?

Based on the selected rows I want to create new rows in ys, basically synchronizing (addition only) ys to xs.

Here is my code to create the keyspaces/tables:

from cassandra.cluster import Cluster

cluster = Cluster()
session = cluster.connect()

session.execute('DROP KEYSPACE keyspace_x;')
session.execute('DROP KEYSPACE keyspace_y;')

session.execute('''
    CREATE KEYSPACE keyspace_x
    WITH replication = {
        'class' : 'SimpleStrategy',
        'replication_factor' : 1
    };''');

session.execute('''
    CREATE KEYSPACE keyspace_y
    WITH replication = {
        'class' : 'SimpleStrategy',
        'replication_factor' : 1
    };''');

session.execute('''CREATE TABLE keyspace_x.xs (
    id bigint,
    x bigint,
    PRIMARY KEY (id)
);''');

session.execute('''CREATE TABLE keyspace_y.ys (
    id bigint,
    y bigint,
    PRIMARY KEY (id)
);''');

session.execute('INSERT INTO keyspace_x.xs(id, x) VALUES (%s, %s)',(0, 10))
session.execute('INSERT INTO keyspace_x.xs(id, x) VALUES (%s, %s)',(1, 11))
session.execute('INSERT INTO keyspace_x.xs(id, x) VALUES (%s, %s)',(2, 12))
session.execute('INSERT INTO keyspace_x.xs(id, x) VALUES (%s, %s)',(3, 13))
session.execute('INSERT INTO keyspace_x.xs(id, x) VALUES (%s, %s)',(4, 14))

session.execute('INSERT INTO keyspace_y.ys(id, y) VALUES (%s, %s)',(0, 100))
session.execute('INSERT INTO keyspace_y.ys(id, y) VALUES (%s, %s)',(2, 144))
session.execute('INSERT INTO keyspace_y.ys(id, y) VALUES (%s, %s)',(4, 196))

In SQL I would use something like that

SELECT xs.id, xs.x FROM xs
LEFT JOIN ys ON xs.id = ys.id
WHERE ys.id IS NULL;

But what about CQL?

Currently I just fetch all IDs from both tables and do a set difference in my application, but I guess this is far from optimal.


Solution

  • There is no multi-table functionality. How implement LEFT or RIGHT JOIN using spark-cassandra-connector You can add relational layers onto Cassandra. How to do a join queries with 2 or more tables in cassandra cql You can fruitlessly search the manual. http://cassandra.apache.org/doc/latest/cql/dml.html Re browsing/googling: Other SQL ways to do what you want are EXCEPTaka MINUS & NOT IN. Algebraic functionality is DIFFERENCE aka MINUS & ANTIJOIN).