Python - creating a reference tree of objects by their attributes

I have a list of objects. These objects have nested attributes, which were generated in accordance with a method by hpaulj's response in this post: Object-like attribute access for nested dictionary.

I want to be able to find attributes, and values of attributes within these objects, and manipulate data they hold. However, in the real scenario, there may be upwards of one million objects, and the attributes may be deeply nested, which makes searching through a flat list a costly exercise when lots of manipulations need to occur.

For example, imagine the list of objects is as follows: list_of_objects = [object1, object2, object3, object4]

object1 has the following attributes: self.country = "Kenya", self.disease = "breast cancer"
object2 has the following attributes: self.country = "Kenya", self.disease = "diabetes"
object3 has the following attributes: self.country = 'Ireland', self.risk_factor.smoking = "Current"
object4 has the following attributes: self.country = 'Kenya', self.risk_factor.smoking = "Previous"

These objects were created from the following State class:

class State:
    def __init__(self, state_dictionary):
        self._state_dictionary = state_dictionary
        for key, value in self._state_dictionary.items():
            if isinstance(value, dict):
                value = State(value)
            setattr(self, key, value)

An example state_dictionary would be as follows, in the case of object3:

state_dictionary = {
    "country":"Ireland",
    "risk_factor":{
        "smoking":"Current"
    }
}

Importantly, nested attributes are State objects too.

I would like to affect all objects that possess an attribute, set of nested attributes, or possess an attribute with a specified value.

My thought, was to create a 'Controller', which would store that original list as separate lists within an object instance of the Controller class. Each of the original attributes and values would point to a list of the objects which contained those attributes or values, the basic design would be as follows:

class Controller:

    def __init__(self, list_of_objects):
        self.list_of_objects = list_of_objects  # Our list of objects from above
        self.create_hierarchy_of_objects()

    def create_hierarchy_of_objects(self):
        for o in self.list_of_objects:
            #  Does something here

After the create_hierarchy_of_objects method has run, I would be able to do the following operations:

Controller.country.Kenya would contain the list of all objects whose self.country is "Kenya", i.e object1, object2, object4
Controller.disease would contain the list of all objects who had the attribute self.disease i.e. object1 and object2
Controller.risk_factor.smoking.Current would contain the list of objects who had that set of attributes i.e. object3

The question is how would create_hierarchy_of_objects work? I few points of clarification

Nesting is arbitrarily long, and values may be the same e.g.
- self.risk_factor.attribute1.attribute2 = "foo"
- self.risk_factor.attribtue3.attribute4 = "foo" as well.
There may be a simpler way to do this, and I would welcome any suggestions.

Solution

If you have to deal with upwards of one million objects, generating an additional hierarchy is probably not the best solution. It would require many additional objects and waste a lot of time for just creating the hierarchy. The hierarchy would also need to be updated whenever list_of_objects change.

Therefore, I suggest a more generic and dynamic approach by using iterators and a XPath like principle. Let's call it OPath. The OPath class is a lightweight object, which simply concatenates the attributes to a kind of attribute path. It also keeps a reference to the original list of entry objects. And finally, it is based on attributes only and so works for any type of object.

The actual lookup happens, when we start iterating through the OPath object (e.g., putting the object into a list(), using a for-loop, ...). The OPath returns an iterator, which recursively looks up the matching objects according to the attribute path based on the actual content in the originally supplied list. It yields one matching object after another to avoid the creation of unnecessary lists with entirely populated matching objects.

class OPath:
    def __init__(self, objects, path = []):
        self.__objects = objects
        self.__path = path

    def __getattr__(self, __name):
        return OPath(self.__objects, self.__path + [__name])

    def __iter__(self):
        yield from (__object for __object in self.__objects if self.__matches(__object, self.__path))

    @staticmethod
    def __matches(__object, path):
        if path:
            if hasattr(__object, path[0]):
                return OPath.__matches(getattr(__object, path[0]), path[1:])
            if __object == path[0] and len(path) <= 1:
                return True
            return False
        return True

if __name__ == '__main__':
    class State:
        def __init__(self, state_dictionary):
            self._state_dictionary = state_dictionary
            for key, value in self._state_dictionary.items():
                if isinstance(value, dict):
                    value = State(value)
                setattr(self, key, value)

    o1 = State({ "country":"Kenya", "disease": "breast cancer" })
    o2 = State({ "country":"Kenya", "disease": "diabetes" })
    o3 = State({ "country":"Ireland", "risk_factor": { "smoking":"Current" } })
    o4 = State({ "country":"Kenya", "risk_factor": { "smoking":"Previous" } })

    # test cases with absolute paths
    print("Select absolute")
    path = OPath([o1, o2, o3, o4])
    print(list(path) == [o1, o2, o3, o4])
    print(list(path.country) == [o1, o2, o3, o4])
    print(list(path.country.Kenya) == [o1, o2, o4])
    print(list(path.disease) == [o1, o2])
    print(list(path.disease.diabetes) == [o2])
    print(list(path.risk_factor.smoking) == [o3, o4])
    print(list(path.risk_factor.smoking.Current) == [o3])
    print(list(path.doesnotexist.smoking.Current) == [])
    print(list(path.risk_factor.doesnotexist.Current) == [])
    print(list(path.risk_factor.smoking.invalidvalue) == [])
    print(list(path.risk_factor.doesnotexist.Current.invalidpath) == [])

    # test cases with relative paths
    country = OPath([o1, o2, o3, o4], ["country"])
    print("Select relative from country:")
    print(list(country) == [o1, o2, o3, o4])
    print(list(country.Kenya) == [o1, o2, o4])

    print("Select all with country=Kenya")
    kenya = OPath([o1, o2, o3, o4], ['country', 'Kenya'])
    print(list(kenya) == [o1, o2, o4])

Output is expected to be True for all test cases.