Search code examples
pythonmatchingfuzzy-search

Return best matches for the input query having several features in python


I have items as follows:

[
    { "id":"item1", "age": 1, "color": 'fff', "rate": 3 },
    { "id":"item2", "age": 2, "color": '000', "rate": 4 },
    { "id":"item3", "age": 3, "color": 'eee', "rate": 5 },
    { "id":"item4", "color": 'bbb', "rate": 5 }
]

Now, I expect user to search for a desirable item: {"age": 1, "color": '000', "rate":5} or even {"age": 3, "color": 'abc'}

I would like to find the best match(es) for this query. How do I do that? I'm not looking for an exact answer. Although, I'm interested to implement it as a backend service, so Python should be fine. I'm just not sure how to tackle the problem. Is there some sort of a matching algorithm or a fuzzy search I need to look at?

UPDATE: the data is big (millions of items), there are 50-100 keys for each items, but some items might not have them all. Also the user queries might not contain all keys.


Solution

  • How large is your dataset?

    For a small dataset, you could do this in O(n*m) time (n items in list, m keys in dict) by simply taking the item that has the most matches after comparing values at same keys, keeping track of number of matches.

    search_item = {"age": 1, "color": '000', "rate": 5}
    mx = -float('inf')
    for item in lst:
        curr = sum(search_item[k]==item[k] for k in item)
        if curr > mx:
            match = item
            mx = curr
    print(match)
    

    The search criterion might not be a simple key-value matching. You get to define that!

    For a very large dataset, you could instead construct a k-d tree from your list and cut down search time to O(log(n)) should in case you'll be reusing the list/tree to search for multiple items.

    You should convert the hex colors to numeric type so you have an homogenous int type across dimensions and comparison is easier, as comparing ints is way simpler than fuzzy string matching.

    For example, the color ffb is closer to fff than eee:

    >>> int('fff', 16)
    4095
    >>> int('ffb', 16)
    4091
    >>> int('eee', 16)
    3822