Search code examples
algorithmsortingcomparisonartificial-intelligence

Pair items from different lists


I am trying to Pair objects from two different sites. They don't use Universal IDs so I need to do it all by hands.

One of those has also a weird way to identify items internally. I don't know how to couple them together. Right now I tried to use what it seems the way they both divide items: - Category (For indoor use, For outdoor, Paper, Metal, Pens,Food Use..) - Product ("Microsoft Widget A","MS Widget B") - Item (10cm red, 200m blue made of paper..)

Item for example could be the different size of the same product (1.5cm,10cm,100cm) on one site (site A), while the other site "split" differently. A Product can be divided in quantities/colors/size that are items. This means that I can have Site a: Product Widget A, red Items: 10cm,20cm, 100cm Site b: Product Widget A, Items: red 10cm,red 20cm,red 100cm, blue 10cm, blue 100cm

Another bad thing is that the categories are not defined the same way, Site A might say "Outdoor Widget for water", while the other has "Outoor", and "Widget for water" in a subcategory. Or even worse it use different wordings.

Right now, to try to find a solution to this, I tried to couple the main categories of Site A with Site B by hand and consider all items of subcategories as item of the parent one. This is sub optimal since some categories on Site B might be ordered differently (for example, there are subcategories like "related" that basically links to other main categories. ) For the products, I just compare all names of the items with the Levenshtein algorithm and combine the products with higher ratio. I am using Site A to define how to group products, with the bad result that I don't group all "Widget A" together, but instead I have pairs for "Widget A,red", "Widget A,blue", etc.

Has anyone an idea how to improve this? Right now the only ideas I have are:

  • to create a big string with the product name and main item features and use levenshtein again.
  • use Site B as main site to compare to.

I'm not an expert of AI, but I've seen algorithm that helps you define objects, but not one that pairs them. I will probably need to write something that classify what is a feature of an item (red, 10cm). Maybe use the category name as well (for example like: Site A : "Outdoor Widget for Water" and site B: Category: "outdoor", product: "Widget for water").


Solution

  • In case someone happens to read my question, I've finally built something using the cosine similarity.

    see for example this: Use sklearn to find string similarity between two texts with large group of documents