I have a Pandas DataFrame containing 10+ columns of data and several million rows.
Three columns form a hierarchy with three different levels: high
, medium
and low
. These three columns contain strings with no missing data. Each column is ordered lexicographically within the overall combined hierarchy, so e.g. ["A…","B…","C…"]
comes before ["H…","A…","B…"]
I would like to add three new integer columns: high_id
, medium_id
, low_id
. Each of these three X_id
columns should have a value for every DataFrame row. The X_id
columns are initially set to 1 for the first row. X_id
columns are incremented when the corresponding X
value for a column differs from the previous row, unless the values for higher levels change which resets the X_id
to 1 instead
Example pure Python implementation:
rows = [
["high1", "med1", "low1"],
["high1", "med1", "low1"],
["high1", "med1", "low2"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low4"],
["high1", "med2", "low5"],
["high1", "med2", "low6"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med4", "low8"],
["high2", "med5", "low9"],
["high2", "med5", "lowA"],
["high2", "med5", "lowA"],
["high2", "med6", "lowB"],
["high3", "med4", "lowC"],
["high3", "med7", "low1"],
["high3", "med7", "lowD"],
["high3", "med7", "lowE"]]
high_id, medium_id, low_id = 1, 1, 1
ids = [[high_id, medium_id, low_id]]
previous_row = rows[0]
for row in rows[1:]:
# Compare "high"
if previous_row[0] != row[0]:
high_id += 1
medium_id = 1
low_id = 1
# Compare "medium"
elif previous_row[1] != row[1]:
medium_id += 1
low_id = 1
# Compare "low"
elif previous_row[2] != row[2]:
low_id += 1
ids.append([high_id, medium_id, low_id])
previous_row = row
for i, v in enumerate(rows):
print(v + ids[i])
Output:
# high, medium, low, high_id, medium_id, low_id
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low1', 1, 1, 1]
['high1', 'med1', 'low2', 1, 1, 2]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low3', 1, 1, 3]
['high1', 'med1', 'low4', 1, 1, 4]
['high1', 'med2', 'low5', 1, 2, 1] # medium changed; low_id reset
['high1', 'med2', 'low6', 1, 2, 2]
['high1', 'med3', 'low7', 1, 3, 1] # medium changed; low_id reset
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med3', 'low7', 1, 3, 1]
['high1', 'med4', 'low8', 1, 4, 1] # medium changed; low_id reset
['high2', 'med5', 'low9', 2, 1, 1] # high changed; low_id, medium_id reset
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med5', 'lowA', 2, 1, 2]
['high2', 'med6', 'lowB', 2, 2, 1] # medium changed; low_id reset
['high3', 'med4', 'lowC', 3, 1, 1] # high changed; low_id, medium_id reset
['high3', 'med7', 'low1', 3, 2, 1] # medium changed; low_id reset
['high3', 'med7', 'lowD', 3, 2, 2]
['high3', 'med7', 'lowE', 3, 2, 3]
Note that the columns actually consist of geographical place names: consequently, values for medium
and low
could, in principle, reappear for a different sequence of parent levels. (There are very few "high" values and I can see that none of them are duplicated.)
What's the idiomatic Pandas way of adding these columns, preferably via vectorised operations?
I've read many existing questions for topics like "hierarchy", "counter", "identifier" and couldn't find anything to match this specific nested case where a "resetting" identifier is required.
I don't know if this is a customary method, but we asked for the information needed to group them together in order to determine their respective IDs. The logic is to combine them together, and the index that matches the list is the ID information. However, I couldn't find a way to avoid loop processing, so I used loop processing. This may not be to your satisfaction, but I'll answer it as one approach.
import pandas as pd
import numpy as np
import io
rows = [
["high1", "med1", "low1"],
["high1", "med1", "low1"],
["high1", "med1", "low2"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low3"],
["high1", "med1", "low4"],
["high1", "med2", "low5"],
["high1", "med2", "low6"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med3", "low7"],
["high1", "med4", "low8"],
["high2", "med5", "low9"],
["high2", "med5", "lowA"],
["high2", "med5", "lowA"],
["high2", "med6", "lowB"],
["high3", "med4", "lowC"],
["high3", "med7", "low1"],
["high3", "med7", "lowD"],
["high3", "med7", "lowE"]]
df = pd.DataFrame(rows, columns=['high','medium','low'])
df['high_id'] = df['high'].str.extract(r'(\d)')
m = df.groupby('high')['medium'].unique().to_frame().reset_index()
l = df.groupby(['high','medium'])['low'].unique().to_frame().reset_index()
df = df.merge(m, on='high', how='outer')
df.rename(columns={'medium_x':'medium'}, inplace=True)
df = df.merge(l, on=['high','medium'], how='outer')
df.tail()
high medium low_x high_id medium_y low_y
16 high2 med6 lowB 2 [med5, med6] [lowB]
17 high3 med4 lowC 3 [med4, med7] [lowC]
18 high3 med7 low1 3 [med4, med7] [low1, lowD, lowE]
19 high3 med7 lowD 3 [med4, med7] [low1, lowD, lowE]
20 high3 med7 lowE 3 [med4, med7] [low1, lowD, lowE]
df['medium_id'] = ''
for i in range(len(df)):
con = np.where(df.loc[i,'medium'] == df.loc[i,'medium_y'])
df.loc[i,'medium_id'] = int(con[0]) + 1
df['low_id'] = ''
for i in range(len(df)):
con = np.where(df.loc[i,'low_x'] == df.loc[i,'low_y'])
df.loc[i,'low_id'] = int(con[0]) + 1
df = df[['high', 'medium', 'low_x', 'high_id', 'medium_id','low_id']]
df.columns = ['high', 'medium', 'low', 'high_id', 'medium_id','low_id']
df
high medium low high_id medium_id low_id
0 high1 med1 low1 1 1 1
1 high1 med1 low1 1 1 1
2 high1 med1 low2 1 1 2
3 high1 med1 low3 1 1 3
4 high1 med1 low3 1 1 3
5 high1 med1 low3 1 1 3
6 high1 med1 low4 1 1 4
7 high1 med2 low5 1 2 1
8 high1 med2 low6 1 2 2
9 high1 med3 low7 1 3 1
10 high1 med3 low7 1 3 1
11 high1 med3 low7 1 3 1
12 high1 med4 low8 1 4 1
13 high2 med5 low9 2 1 1
14 high2 med5 lowA 2 1 2
15 high2 med5 lowA 2 1 2
16 high2 med6 lowB 2 2 1
17 high3 med4 lowC 3 1 1
18 high3 med7 low1 3 2 1
19 high3 med7 lowD 3 2 2
20 high3 med7 lowE 3 2 3