I'm using MySQL 5.6. Let's say we have the following two tables:
Every DataSet
has a huge amount of child DataEntry
records that the number would be 10000 or 100000 or more. DataSet.md5sum
and DataSet.version
get updated when its child DataEntry
records are inserted or deleted, in one transaction. A DataSet.md5sum
is calculated against all of its children DataEntry.content
s.
Under this situation, What's the most efficient way to fetch consistent data from those two tables?
If I issue the following two distinct SELECTs, I think I might get inconsistent data due to concurrent INSERT / UPDATEs:
SELECT md5sum, version FROM DataSet WHERE dataset_id = 1000
SELECT dataentry_id, content FROM DataEntry WHERE dataset_id = 1000
-- I think the result of this query will possibly incosistent with the md5sum which fetched by former queryI think I can get consistent data with one query as follows:
SELECT e.dataentry_id, e.content, s.md5sum, s.version
FROM DataSet s
INNER JOIN DataEntry e ON (s.dataset_id = e.dataset_id)
WHERE s.dataset_id = 1000
But it produces redundant dataset which filled with 10000 or 100000 duplicated md5sum
s, So I guess it's not efficient (EDIT: My concerns are high network bandwidth and memory consumption).
I think using pessimistic read / write lock (SELECT ... LOCK IN SHARE MODE
/ FOR UPDATE
) would be another option but it seems overkill. Are there any other better approaches?
The join will ensure that the data returned is not affected by any updates that would have occurred between the two separate selects, since they are being executed as a single query.
When you say that md5sum and version are updated, do you mean the child table has a trigger on it for inserts and updates?
When you join the tables, you will get a "duplicate md5sum and version" because you are pulling the matching record for each item in the DataEntry
table. It is perfectly fine and isn't going to be an efficiency issue. The alternative would be to use the two individual selects, but depending upon the frequency of inserts/updates, without a transaction, you run the very slight risk of getting data that may be slightly off.
I would just go with the join. You can run explain plans on your query from within mysql and look at how the query is executed and see any differences between the two approaches based upon your data and if you have any indexes, etc...
Perhaps it would be more beneficial to run these groups of records into a staging table of sorts. Before processing, you could call a pre-processor function that takes a "snapshot" of the data about to be processed, putting a copy into a staging table. Then you could select just the version and md5sum alone, and then all of the records, as two different selects. Since these are copied into a separate staging table, you wont have to worry about immediate updates corrupting your session of processing. You could set up timed jobs to do this or have it as an on-demand call. Again though, this would be something you would need to research the best approach given the hardware/network setup you are working with. And any job scheduling software you have available to you.