I'm thinking out my strategy for merging (and de-duplicating) multiple catalogs of products.
I'll be using a no-sql database, and need to query N catalogs of partially overlapping products.
Certain aspects such as categorization, tags, descriptions, etc need to be normalized, and I need to track what catalogs contain each unique item (de-duplication of products in each catalog, by UPC for example).
My current thought is to import the individual catalogs into their own tables, then use self-built algorithms to identify "similar" items, perform normalization, then create a final "Master" table which contains the normalized & de-duplcated data - (the master-record values would be copied from whichever catalog or mix of catalogs it's chosen from & contain links to which catalogs contain that item).
I wonder what other thoughts exist on the subject? What areas of research should I look into to better educate myself?
You didn't supply a lot of details but from what I understand, if you'd be using HBase you can do the following:
write a map/reduce to sort things out:
2.1. in the map phase normalize and emit the potential keys
2.2. int he reduce phase (where you get all the records with the same key) produce the master record