Search code examples
databasearchitectureduplicatessystem-designapplication-design

Customer Deduplication in Booking Application


We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.

The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:

  • If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
  • If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.

I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).

I would appreciate very much any help so I can drive the engineering team to the right direction.


Solution

  • In General

    What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?

    It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.

    If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.

    If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.

    In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.

    Specifics

    ...check if the (newly created) customer for this reservation has already an old customer id (by comparing email...

    Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.

    ... and other aspects.

    What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?

    If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.

    Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.

    As we have several operational applications relying on that data, this creates a massive sync issue.

    ...case in point.

    In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.

    Option - Phased Introduction Based on Testing

    You could try this:

    1. Keep the system as-is for now.
    2. Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
    3. Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
    4. Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.

    Alternative Customer/Session Approach

    1. Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
    2. Allow users to optionally be validated as "Customers" (big C).
    3. Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.

    I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.