Search code examples
pythonobjectfilterhierarchical

Python - creating a reference tree of objects by their attributes


I have a list of objects. These objects have nested attributes, which were generated in accordance with a method by hpaulj's response in this post: Object-like attribute access for nested dictionary.

I want to be able to find attributes, and values of attributes within these objects, and manipulate data they hold. However, in the real scenario, there may be upwards of one million objects, and the attributes may be deeply nested, which makes searching through a flat list a costly exercise when lots of manipulations need to occur.

For example, imagine the list of objects is as follows: list_of_objects = [object1, object2, object3, object4]

  • object1 has the following attributes: self.country = "Kenya", self.disease = "breast cancer"
  • object2 has the following attributes: self.country = "Kenya", self.disease = "diabetes"
  • object3 has the following attributes: self.country = 'Ireland', self.risk_factor.smoking = "Current"
  • object4 has the following attributes: self.country = 'Kenya', self.risk_factor.smoking = "Previous"

These objects were created from the following State class:

class State:
    def __init__(self, state_dictionary):
        self._state_dictionary = state_dictionary
        for key, value in self._state_dictionary.items():
            if isinstance(value, dict):
                value = State(value)
            setattr(self, key, value)

An example state_dictionary would be as follows, in the case of object3:

state_dictionary = {
    "country":"Ireland",
    "risk_factor":{
        "smoking":"Current"
    }
}

Importantly, nested attributes are State objects too.

I would like to affect all objects that possess an attribute, set of nested attributes, or possess an attribute with a specified value.

My thought, was to create a 'Controller', which would store that original list as separate lists within an object instance of the Controller class. Each of the original attributes and values would point to a list of the objects which contained those attributes or values, the basic design would be as follows:

class Controller:

    def __init__(self, list_of_objects):
        self.list_of_objects = list_of_objects  # Our list of objects from above
        self.create_hierarchy_of_objects()

    def create_hierarchy_of_objects(self):
        for o in self.list_of_objects:
            #  Does something here

After the create_hierarchy_of_objects method has run, I would be able to do the following operations:

  • Controller.country.Kenya would contain the list of all objects whose self.country is "Kenya", i.e object1, object2, object4
  • Controller.disease would contain the list of all objects who had the attribute self.disease i.e. object1 and object2
  • Controller.risk_factor.smoking.Current would contain the list of objects who had that set of attributes i.e. object3

The question is how would create_hierarchy_of_objects work? I few points of clarification

  • Nesting is arbitrarily long, and values may be the same e.g.
    • self.risk_factor.attribute1.attribute2 = "foo"
    • self.risk_factor.attribtue3.attribute4 = "foo" as well.
  • There may be a simpler way to do this, and I would welcome any suggestions.

Solution

  • If you have to deal with upwards of one million objects, generating an additional hierarchy is probably not the best solution. It would require many additional objects and waste a lot of time for just creating the hierarchy. The hierarchy would also need to be updated whenever list_of_objects change.

    Therefore, I suggest a more generic and dynamic approach by using iterators and a XPath like principle. Let's call it OPath. The OPath class is a lightweight object, which simply concatenates the attributes to a kind of attribute path. It also keeps a reference to the original list of entry objects. And finally, it is based on attributes only and so works for any type of object.

    The actual lookup happens, when we start iterating through the OPath object (e.g., putting the object into a list(), using a for-loop, ...). The OPath returns an iterator, which recursively looks up the matching objects according to the attribute path based on the actual content in the originally supplied list. It yields one matching object after another to avoid the creation of unnecessary lists with entirely populated matching objects.

    class OPath:
        def __init__(self, objects, path = []):
            self.__objects = objects
            self.__path = path
    
        def __getattr__(self, __name):
            return OPath(self.__objects, self.__path + [__name])
    
        def __iter__(self):
            yield from (__object for __object in self.__objects if self.__matches(__object, self.__path))
    
        @staticmethod
        def __matches(__object, path):
            if path:
                if hasattr(__object, path[0]):
                    return OPath.__matches(getattr(__object, path[0]), path[1:])
                if __object == path[0] and len(path) <= 1:
                    return True
                return False
            return True
    
    if __name__ == '__main__':
        class State:
            def __init__(self, state_dictionary):
                self._state_dictionary = state_dictionary
                for key, value in self._state_dictionary.items():
                    if isinstance(value, dict):
                        value = State(value)
                    setattr(self, key, value)
    
        o1 = State({ "country":"Kenya", "disease": "breast cancer" })
        o2 = State({ "country":"Kenya", "disease": "diabetes" })
        o3 = State({ "country":"Ireland", "risk_factor": { "smoking":"Current" } })
        o4 = State({ "country":"Kenya", "risk_factor": { "smoking":"Previous" } })
    
        # test cases with absolute paths
        print("Select absolute")
        path = OPath([o1, o2, o3, o4])
        print(list(path) == [o1, o2, o3, o4])
        print(list(path.country) == [o1, o2, o3, o4])
        print(list(path.country.Kenya) == [o1, o2, o4])
        print(list(path.disease) == [o1, o2])
        print(list(path.disease.diabetes) == [o2])
        print(list(path.risk_factor.smoking) == [o3, o4])
        print(list(path.risk_factor.smoking.Current) == [o3])
        print(list(path.doesnotexist.smoking.Current) == [])
        print(list(path.risk_factor.doesnotexist.Current) == [])
        print(list(path.risk_factor.smoking.invalidvalue) == [])
        print(list(path.risk_factor.doesnotexist.Current.invalidpath) == [])
    
        # test cases with relative paths
        country = OPath([o1, o2, o3, o4], ["country"])
        print("Select relative from country:")
        print(list(country) == [o1, o2, o3, o4])
        print(list(country.Kenya) == [o1, o2, o4])
    
        print("Select all with country=Kenya")
        kenya = OPath([o1, o2, o3, o4], ['country', 'Kenya'])
        print(list(kenya) == [o1, o2, o4])
    

    Output is expected to be True for all test cases.