Search code examples
pythonpandasdataframehierarchy

Creating numeric identifiers for hierarchical DataFrame columns


I have a Pandas DataFrame containing 10+ columns of data and several million rows.

Three columns form a hierarchy with three different levels: high, medium and low. These three columns contain strings with no missing data. Each column is ordered lexicographically within the overall combined hierarchy, so e.g. ["A…","B…","C…"] comes before ["H…","A…","B…"]

I would like to add three new integer columns: high_id, medium_id, low_id. Each of these three X_id columns should have a value for every DataFrame row. The X_id columns are initially set to 1 for the first row. X_id columns are incremented when the corresponding X value for a column differs from the previous row, unless the values for higher levels change which resets the X_id to 1 instead

Example pure Python implementation:

rows = [
    ["high1", "med1", "low1"],
    ["high1", "med1", "low1"],
    ["high1", "med1", "low2"],
    ["high1", "med1", "low3"],
    ["high1", "med1", "low3"],
    ["high1", "med1", "low3"],
    ["high1", "med1", "low4"],
    ["high1", "med2", "low5"],
    ["high1", "med2", "low6"],
    ["high1", "med3", "low7"],
    ["high1", "med3", "low7"],
    ["high1", "med3", "low7"],
    ["high1", "med4", "low8"],
    ["high2", "med5", "low9"],
    ["high2", "med5", "lowA"],
    ["high2", "med5", "lowA"],
    ["high2", "med6", "lowB"],
    ["high3", "med4", "lowC"],
    ["high3", "med7", "low1"],
    ["high3", "med7", "lowD"],
    ["high3", "med7", "lowE"]]

high_id, medium_id, low_id = 1, 1, 1
ids = [[high_id, medium_id, low_id]]
previous_row = rows[0]

for row in rows[1:]:
    # Compare "high"
    if previous_row[0] != row[0]:
        high_id += 1
        medium_id = 1
        low_id = 1
    # Compare "medium"
    elif previous_row[1] != row[1]:
        medium_id += 1
        low_id = 1
    # Compare "low"
    elif previous_row[2] != row[2]:
        low_id += 1
    ids.append([high_id, medium_id, low_id])
    previous_row = row

for i, v in enumerate(rows):
    print(v + ids[i])

Output:

# high, medium, low, high_id, medium_id, low_id
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low2', 1, 1, 2]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low4', 1, 1, 4]
['high1', 'med2', 'low5', 1, 2, 1] # medium changed; low_id reset
['high1', 'med2', 'low6', 1, 2, 2]
['high1', 'med3', 'low7', 1, 3, 1] # medium changed; low_id reset
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med4', 'low8', 1, 4, 1] # medium changed; low_id reset
['high2', 'med5', 'low9', 2, 1, 1] # high changed; low_id, medium_id reset
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med6', 'lowB', 2, 2, 1] # medium changed; low_id reset
['high3', 'med4', 'lowC', 3, 1, 1] # high changed; low_id, medium_id reset
['high3', 'med7', 'low1', 3, 2, 1] # medium changed; low_id reset
['high3', 'med7', 'lowD', 3, 2, 2]
['high3', 'med7', 'lowE', 3, 2, 3]

Note that the columns actually consist of geographical place names: consequently, values for medium and low could, in principle, reappear for a different sequence of parent levels. (There are very few "high" values and I can see that none of them are duplicated.)

What's the idiomatic Pandas way of adding these columns, preferably via vectorised operations?

I've read many existing questions for topics like "hierarchy", "counter", "identifier" and couldn't find anything to match this specific nested case where a "resetting" identifier is required.


Solution

  • I don't know if this is a customary method, but we asked for the information needed to group them together in order to determine their respective IDs. The logic is to combine them together, and the index that matches the list is the ID information. However, I couldn't find a way to avoid loop processing, so I used loop processing. This may not be to your satisfaction, but I'll answer it as one approach.

    import pandas as pd
    import numpy as np
    import io
    
    rows = [
        ["high1", "med1", "low1"],
        ["high1", "med1", "low1"],
        ["high1", "med1", "low2"],
        ["high1", "med1", "low3"],
        ["high1", "med1", "low3"],
        ["high1", "med1", "low3"],
        ["high1", "med1", "low4"],
        ["high1", "med2", "low5"],
        ["high1", "med2", "low6"],
        ["high1", "med3", "low7"],
        ["high1", "med3", "low7"],
        ["high1", "med3", "low7"],
        ["high1", "med4", "low8"],
        ["high2", "med5", "low9"],
        ["high2", "med5", "lowA"],
        ["high2", "med5", "lowA"],
        ["high2", "med6", "lowB"],
        ["high3", "med4", "lowC"],
        ["high3", "med7", "low1"],
        ["high3", "med7", "lowD"],
        ["high3", "med7", "lowE"]]
    
    df = pd.DataFrame(rows, columns=['high','medium','low'])
    df['high_id'] = df['high'].str.extract(r'(\d)')
    m = df.groupby('high')['medium'].unique().to_frame().reset_index()
    l = df.groupby(['high','medium'])['low'].unique().to_frame().reset_index()
    df = df.merge(m, on='high', how='outer')
    df.rename(columns={'medium_x':'medium'}, inplace=True)
    df = df.merge(l, on=['high','medium'], how='outer')
    
    df.tail()
        high    medium  low_x   high_id medium_y    low_y
    16  high2   med6    lowB    2   [med5, med6]    [lowB]
    17  high3   med4    lowC    3   [med4, med7]    [lowC]
    18  high3   med7    low1    3   [med4, med7]    [low1, lowD, lowE]
    19  high3   med7    lowD    3   [med4, med7]    [low1, lowD, lowE]
    20  high3   med7    lowE    3   [med4, med7]    [low1, lowD, lowE]
    
    df['medium_id'] = ''
    for i in range(len(df)):
        con = np.where(df.loc[i,'medium'] == df.loc[i,'medium_y'])
        df.loc[i,'medium_id'] = int(con[0]) + 1
    
    df['low_id'] = ''
    for i in range(len(df)):
        con = np.where(df.loc[i,'low_x'] == df.loc[i,'low_y'])
        df.loc[i,'low_id'] = int(con[0]) + 1
    
    df = df[['high', 'medium', 'low_x', 'high_id', 'medium_id','low_id']]
    df.columns = ['high', 'medium', 'low', 'high_id', 'medium_id','low_id']
    df
        high    medium  low high_id medium_id   low_id
    0   high1   med1    low1    1   1   1
    1   high1   med1    low1    1   1   1
    2   high1   med1    low2    1   1   2
    3   high1   med1    low3    1   1   3
    4   high1   med1    low3    1   1   3
    5   high1   med1    low3    1   1   3
    6   high1   med1    low4    1   1   4
    7   high1   med2    low5    1   2   1
    8   high1   med2    low6    1   2   2
    9   high1   med3    low7    1   3   1
    10  high1   med3    low7    1   3   1
    11  high1   med3    low7    1   3   1
    12  high1   med4    low8    1   4   1
    13  high2   med5    low9    2   1   1
    14  high2   med5    lowA    2   1   2
    15  high2   med5    lowA    2   1   2
    16  high2   med6    lowB    2   2   1
    17  high3   med4    lowC    3   1   1
    18  high3   med7    low1    3   2   1
    19  high3   med7    lowD    3   2   2
    20  high3   med7    lowE    3   2   3