Search code examples
pythonlarge-data

Efficiently find objects satisfying relationship


Let's say I have some objects, like in this example (JSON code):

{
    "people" : {
        "Alice" : {
            "position" : "Manager",
            "company" : "Company1"
        },
        "Bob" : {
            "position" : "CEO",
            "company" : "Company1"
        },
        "Charlie" : {
            "position" : "CEO",
            "company" : "Company2"
        }
    },
    "companies" : [
        { "name" : "Company1" },
        { "name" : "Company2" }
    ]
}

And I want to write a function get_X_of_Y(x, y) that I could use to pass, for example, get_X_of_Y("CEO", companies[0]) and have it return Bob.

How could I efficiently do this for large datasets? I have the following function:

def get_X_of_Y (x, y):
    for person in people:
        if person.position == x and person.company == company.name:
            return person
    else:
        return None

Suppose I have thousands of people and hundreds of companies. Is there a faster way to do this then by looping through everyone? I can precompute the objects if there's a way to make things faster.


Solution

  • Let us say

    data = {
        "people" : {
            "Alice" : {
                "position" : "Manager",
                "company" : "Company1"
            },
            "Bob" : {
                "position" : "CEO",
                "company" : "Company1"
            },
            "Charlie" : {
                "position" : "CEO",
                "company" : "Company2"
            }
        },
        "companies" : [
            { "name" : "Company1" },
            { "name" : "Company2" }
        ]
    }
    

    Then You could create a list of people, which is basically a flat structure as compared to your nested dict:

    >>> people = [(key, value["position"], value["company"]) for key, value in data["people"].items()]
    [('Charlie', 'Company2', 'CEO'),
     ('Bob', 'Company1', 'CEO'),
     ('Alice', 'Company1', 'Manager')]
    

    And also a list of companies, which again does away with structure of dict:

    >>> companies = [item['name'] for item in data["companies"]]
    ['Company1', 'Company2']
    

    Now querying is pretty simple, use a filter method

    def get_X_of_Y (x, y):
        return filter(lambda item: item[1]==x and item[2]==y, people)
    

    And so you can easily search now:

    >>> get_X_of_Y("CEO", companies[0])
    [('Bob', 'CEO', 'Company1')]
    

    However, I would still suggest using a database if you really have thousands of people and hundreds of companies.