Search code examples
pythonsqlalchemyouter-join

Complex join with nested or subqueries in sqlalchemy - find the closest date match


We have 2 tables that contain a global_id to join on. Let's call them device_updates and client_updates. Their respective tables (albeit reduced to help solve the problem) look like the following:

device_updates

update_id global_id created device_date
Integer Integer Datetime Datetime

client_updates

client_update_id global_id updated updated_by update_type
Integer Integer Datetime Integer String

*Note that the real table has an Integer representing the update_type. this is just for clarity here.

When the device changes state, it always updates the device_updates table. If this is an alarm state, then the client is notified and can choose to 'update' that alarm to be silenced (as an example).

What we need to be able to do is generate a history of all events from the device updates and client updates in chronological order. But since neither table is aware of each other (other than the global_id) I can't figure out how to ensure that each returned row contains the exact state at that point in time. Also, since there is a possibility of having a few thousand devices update each second, and a client can silence all with a single click, we can't add a column in the client updates table that references the device table row as it's both slow, and we were also planning to do bulk uploads to handle the speed requirements which means we can't assign a matched id.

As an example flow:

  1. device with global_id 15 changes state
  2. client 7 updates this to silence it a few seconds later
  3. device with global_id 15 changes state again at some time in the future

What should be returned in an 'event_history' query is the following:

update_id global_id created device_date client_update_id updated updated_by update_type
1 15 10:53:24 10:53:23 Null Null Null Null
2 15 10:53:24 10:53:23 1 10:53:34 7 "Silence"
3 15 10:57:09 10:57:8 Null Null Null Null

To sum up, how do I join the device_updates table to the client_updates table, but note that device updates won't necessarily have an associated client update, but all client updates will need to find the exact device update that matches the updated time (noting that the update always occurs after the device_date). Also note that there are potentially many thousands of devices, so it's not just for a single global_id.

UPDATE I forgot to add an example of a query I have tried, which is below:

db.query(
  client_updates,
  device_updates
).outerjoin(
  device_updates, device_updates.device_update_id==db.query(
    device_updates
  ).filter(
    device_updates.global_id==client_updates.global_id,
    device_updates.device_date<client_updates.updated
  ).order_by(
    device_updates.updated.desc()
  ).first().device_update_id
)

But this appears to find the same device_update_id to match everything with.


Solution

  • Ended up going with the Kafka solution. Super speedy results, as my test build was comfortably able to handle a very high throughput and I now have confidence in that being robust and fast.