Problem
I have two lists of objects. Each object contains the following:
GUID
(allows to determine if objects are the same — from business
point of view)Timestamp
(updates to current UTC each time the
object changed)Version
(positive integer; increments each time
the object changed)Deleted
(boolean flag; switches to "true" instead
of actual object deleting)Data
(some useful payload)Next, I need to sync two lists according to these rules:
GUID
presented only in one list, it should be copied to another listGUID
presented in both lists, the instance with less Version
should be replaced with one having greater Version
(nothing to do if versions are equal)Real-world requirements:
Solution #1 (currently implemented)
T
)Timestamp
> T
(i.e. recently modified; in the production it's ... > (T
- day) for better robustness)Version
's compared and saved to appropriative list (if need)Procs:
Cons:
T
, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T
like 1970-01-01 just hangs the sync process)My questions:
P.S. Already viewed, not duplicates:
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it. Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
0000
→ (hash of items with GUID 0000*
)0001
→ (hash of items with GUID 0001*
)ffff
→ (hash of items with GUID ffff*
);000
→ (hash of hashes 000_
)00
→ (hash of hashes 00_
)_
)Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash()
— returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);getHashChildren(x)
— returns list of hashes x_
(at most 16); used for effective, single-request discovering data difference;findByIdPrefix(x)
— returns items with GUID x*
, x must contain exactly 4 chars; used for requesting leaf items;count(x)
— returns number of items with GUID x*
; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
version_1
= version_2
, but hash_1
!= hash_2
}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.(GUID, Version)
pairs without payload for lightweighting requests.