I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of data. The resulting unique records need to be noted into db.
I have been able to implement mapreduce, iterarate of all records. Task rate is set to 100/s and bucket-size to 100. Billing enabled.
Currently, everything is working, but performance is very very slow. I have been able to complete only 1000 records dedupe processing among a test dataset of 10,000 records in 6 hours.
The current design in java is:
I am ready to increase any amount of GAE resources to achieve this in shortest time.
My Questions are:
Freelancers most welcome to assist in this.
Thanks for your help.
I see 2 ways to approach this problem:
(If you only need to do it once) AppEngine creates a property index for every property in your entity (unless you ask it not to do that). Create a backend, run a query "SELECT * FROM ORDER BY " in batches using cursors, determine duplicated properties and fix/delete those. You might be able to parallelize this, but it's tricky on shard boundaries and you will probably have to write all the code yourself.
You can use mapper framework to do it slower, but run in parallel. This approach also allows you to efficiently dedupe data on insert. Introduce a new entity to hold unique property values. Say "UniquePhoneNumber". The entity should hold a phone number as a key and a reference to the entity with this phone number. Now run a map and do a lookup for UniquePhoneNumber. If it's found and its reference is valid, delete the duplicate. If not create a new one with correct reference. This way it's even possible to repoint a reference to the other one, if you need to. Make sure that you read UniquePhoneNumber and create a new one/update a new one inside a single transaction. Otherwise duplicates won't be detected.