Search code examples
pythonpandasdataframenumpyrelational-database

Last member of each element of an id list of indices in relational dataset


Suppose I have two datasets in python: households and people (individuals). A key or id (int64) connects a household with one or more individuals. I want to create a binary variable called "last_member" that takes a value of 0 if there are more individuals in the same household, and 1 if this individual is the last member of the household.

A trivial example would be the following:

last_member id ...
0           1  ...
0           1  ...
1           1  ...
1           2  ...
0           3  ...
1           3  ...
...

I can get the number of unique ids from the households dataset or from the individual's dataset itself.

I get a feeling that either numpy's where function, or pandas' aggregate are strong candidates to find such a solution. Still, I can't wrap my head around an efficient solution that does not involve, let's say, looping over the list of indices.


Solution

  • I coded a function that runs efficiently and solves the problem. The idea is to create the variable "last_member" full of zeros. This variable lets us compute the number of members per id using pandas' groupby. Then we compute the cumulative sum (minus 1, because of python's indexing) to find the indices where we would like to change the values of the "last_member" variable to 1.

    def create_last_member_variable(data):
        """ Creates a last_member variable based on the index of id variable.
        """
        data["last_member"] = 0
        n_members = data.groupby(["id"]).count()["last_member"]
        row_idx = np.cumsum(n_members) - 1 
        data.loc[row_idx, "last_member"] = 1
    
        return data