Search code examples
pythonmathoptimizationdata-structuresspatial-index

Need a proper data structure or index for fast user lookup based on 3d points and importance factors


I have a large number of 3d points paired with importance factors.

Each user has a six points. For example: Person Charlie has 6 points: (22,44,55) is his first point with an importance factor of 3, (10,0,0) is his second vector with a importance factor of 2.8 all the way down to his sixth point that is (100,300,200) with an importance factor of 0.4.

What I would like to do is to find the person that is most similar to Charlie without iterating through every single other person. Essentially minimizing this function for every user (ie, matching up the right six points from that user to Charlie):

pythagoras(point, point2) * max(importance_factor, importance_factor2) * (abs(importance_factor - importance_factor2) + 1)

And then finding the user that is most similar to Charlie over all by choosing the one with the lowest cost. I've written the code the dumb way for now (by doing lots of loops) but I'm looking for a way to properly handle the fact that there are multiple points AND importance factors.

I started looking into spacial indexes, but I don't think they will work since I have multiple points, but perhaps I could unroll the points into a higher dimensional point? So instead of 6 points in 3 dimensions I could have 1 point in 18 dimensions? Still can't handle the importance factor, but it would better than nothing.

Unfortunately I can't us vectors and cosines here, since (1,1,1) and (400,400,400) are very opposite things.

Any ideas?


Solution

  • Since you haven't gotten any answers yet, I thought I would at least contribute some thoughts. I have used a python k-d tree module for quickly searching nearest neighbor points:
    http://code.google.com/p/python-kdtree/downloads/detail?name=kdtree.py
    It takes arbitrary point lengths as long as they are the same sizes.

    I'm not sure how you would want to apply the weighting of the "importance", but here is just a brainstorm on how to use the kdtree module to at least get the closest "people" to each point of a given person's set:

    import numpy
    from kdtree import KDTree
    from itertools import chain
    
    class PersonPoint(object):
    
        def __init__(self, person, point, factor):
            self.person = person 
            self.point = point 
            self.factor = factor 
    
        def __repr__(self):
            return '<%s: %s, %0.2f>' % (self.person, 
                ['%0.2f' % p for p in self.point], self.factor) 
    
        def __iter__(self):
            return self.point
    
        def __len__(self):
            return len(self.point)
    
        def __getitem__(self, i):
            return self.point[i]
    
    
    people = {}
    for name in ('bill', 'john', 'mary', 'jenny', 'phil', 'george'):
        factors = numpy.random.rand(6)
        points = numpy.random.rand(6, 3).tolist()
        people[name] = [PersonPoint(name, p, f) for p,f in zip(points, factors)]
    
    bill_points = people['bill']
    others = list(chain(*[people[name] for name in people if name != 'bill']))
    
    tree = KDTree.construct_from_data(others)
    
    for point in bill_points:
        # t=1 means only return the 1 closest.
        # You could set it higher to return more.
        print point, "=>", tree.query(point, t=1)[0]
    

    Results:

    <bill: ['0.22', '0.64', '0.14'], 0.07> => 
        <phil: ['0.23', '0.54', '0.11'], 0.90>
    
    <bill: ['0.31', '0.87', '0.16'], 0.88> => 
        <phil: ['0.36', '0.80', '0.14'], 0.40>
    
    <bill: ['0.34', '0.64', '0.25'], 0.65> => 
        <jenny: ['0.29', '0.77', '0.28'], 0.40>
    
    <bill: ['0.24', '0.90', '0.23'], 0.53> => 
        <jenny: ['0.29', '0.77', '0.28'], 0.40>
    
    <bill: ['0.50', '0.69', '0.06'], 0.68> => 
        <phil: ['0.36', '0.80', '0.14'], 0.40>
    
    <bill: ['0.13', '0.67', '0.93'], 0.54> => 
        <jenny: ['0.05', '0.62', '0.94'], 0.84>
    

    I figured with the result, you could look at the most frequent matched "person" or then consider the weights. Or maybe you can total up the important factors in the results and then take the highest rated one. That way, if mary only matched once but had like a 10 factor, and phil had 3 matched but only totaled to 5, mary might be more relevant?

    I know you have a more robust function for creating an index but it would require going through every point in your collection.