java google-app-engine hadoop mapreduce deduplication

Bad Performance for Dedupe of 2 million records using mapreduce on Appengine

I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of data. The resulting unique records need to be noted into db.

I have been able to implement mapreduce, iterarate of all records. Task rate is set to 100/s and bucket-size to 100. Billing enabled.

Currently, everything is working, but performance is very very slow. I have been able to complete only 1000 records dedupe processing among a test dataset of 10,000 records in 6 hours.

The current design in java is:

In every map iteration, I compare the current record with the previous record
Previous record is a single record in db which acts like a global variable which I overwrite with another previous record in each map iteration
Comparison is done using an algorithm and result is written as a new entity to db
At the end of one Mapreduce job, i programatically create another job
The previous record variable helps the job to compare with next candidate record with rest of the data

I am ready to increase any amount of GAE resources to achieve this in shortest time.

My Questions are:

Will the accuracy of dedupe (checking for duplicates) affect due to parallel jobs/tasks?
How can this design be improved?
Will this scale to 20 million records
Whats the fastest way to read/write variables (not just counters) during map iteration which can be used across one mapreduce job.

Freelancers most welcome to assist in this.

Thanks for your help.

Solution

I see 2 ways to approach this problem:

(If you only need to do it once) AppEngine creates a property index for every property in your entity (unless you ask it not to do that). Create a backend, run a query "SELECT * FROM ORDER BY " in batches using cursors, determine duplicated properties and fix/delete those. You might be able to parallelize this, but it's tricky on shard boundaries and you will probably have to write all the code yourself.
You can use mapper framework to do it slower, but run in parallel. This approach also allows you to efficiently dedupe data on insert. Introduce a new entity to hold unique property values. Say "UniquePhoneNumber". The entity should hold a phone number as a key and a reference to the entity with this phone number. Now run a map and do a lookup for UniquePhoneNumber. If it's found and its reference is valid, delete the duplicate. If not create a new one with correct reference. This way it's even possible to repoint a reference to the other one, if you need to. Make sure that you read UniquePhoneNumber and create a new one/update a new one inside a single transaction. Otherwise duplicates won't be detected.