Search code examples
database-designarchitectureduplicateshbaserecord-linkage

Data Architecture: Deduplication of product catalogs


I'm thinking out my strategy for merging (and de-duplicating) multiple catalogs of products.

I'll be using a no-sql database, and need to query N catalogs of partially overlapping products.

Certain aspects such as categorization, tags, descriptions, etc need to be normalized, and I need to track what catalogs contain each unique item (de-duplication of products in each catalog, by UPC for example).

My current thought is to import the individual catalogs into their own tables, then use self-built algorithms to identify "similar" items, perform normalization, then create a final "Master" table which contains the normalized & de-duplcated data - (the master-record values would be copied from whichever catalog or mix of catalogs it's chosen from & contain links to which catalogs contain that item).

I wonder what other thoughts exist on the subject? What areas of research should I look into to better educate myself?


Solution

  • You didn't supply a lot of details but from what I understand, if you'd be using HBase you can do the following:

    1. write all the data into hbase in the original formats or close to that
    2. write a map/reduce to sort things out:

      2.1. in the map phase normalize and emit the potential keys

      2.2. int he reduce phase (where you get all the records with the same key) produce the master record

    3. export master record to where you'd like