I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.
Here's a list of possible relationships between columns:
I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!
I have Snowflake and Databricks at my disposal.
Here's my test dataset:
import pandas as pd
data = [
[1, 1, 'apple', 'fiat', 1],
[2, 1, 'pear', 'bmw', 1],
[3, 2, 'bananna', 'citroen', 2],
[4, 3, 'bananna', 'kia', 3],
[5, 4, 'blueberry', 'peugeot', 4],
[6, None, 'blueberry', 'peugeot', 4],
[7, None, 'blueberry', 'yamaha', 4],
[8, 5, 'plum', 'ford', 5],
[9, None, 'watermelon', 'ford', 5],
[10, None, 'raspberry', 'honda', 6],
[11, None, 'raspberry', 'toyota', 6],
[12, None, 'avocado', 'mercedes', 7],
[13, None, 'cherry', 'mercedes', 7],
[14, None, 'apricot', 'volkswagen', 2],
[15, 2, 'apricot', 'volkswagen', 2],
[16, 6, 'blueberry', 'audi', 8],
[17, None, 'blackberry', 'bmw', 1],
[18, 7, 'plum', 'porsche', 9]
]
df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
Based on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id
. Update: The algorithm now features the arbitrary tie-break when multiple user_id
s match multiple universal_id
s. Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you.
global_id
for every user_id
that has
multiple occurrences (n>1)global_id
to all rows with matching
user_id
global_id
for every user_id
that has single occurrence (n=1)global_id
to all rows with matching universal_id
, i.e. rows which don't match on user_id
but match on universal_id
. There is an arbitrary tie break if multiple universal_id
s match on one or more user_id
s, where all matching universal_id
s are assigned to the same user_id
global_id
for every universal_id
which
cannot be linked to a user_id
but has multiple occurrences
(n>1)global_id
to all rows with
matching universal_id
global_id
to all rows with matching session_id
, i.e. rows with don't match on neither user_id
or universal_id
but on session_id
global_id
for every session_id
which cannot
be linked to neither a user_id
nor universal_id
but has
multiple occurrences (n>1)global_id
to all rows with matching session_id
global_id
for every session_id which does not have
multiple occurrences (n=1)Hope this helps!
import uuid
import pandas as pd
import numpy as np
data = [
[1, 1, 'apple', 'fiat', 1],
[2, 1, 'pear', 'bmw', 1],
[3, 2, 'bananna', 'citroen', 2],
[4, 3, 'bananna', 'kia', 3],
[5, 4, 'blueberry', 'peugeot', 4],
[6, None, 'blueberry', 'peugeot', 4],
[7, None, 'blueberry', 'yamaha', 4],
[8, 5, 'plum', 'ford', 5],
[9, None, 'watermelon', 'ford', 5],
[10, None, 'raspberry', 'honda', 6],
[11, None, 'raspberry', 'toyota', 6],
[12, None, 'avocado', 'mercedes', 7],
[13, None, 'cherry', 'mercedes', 7],
[14, None, 'apricot', 'volkswagen', 2],
[15, 2, 'apricot', 'volkswagen', 2],
[16, 6, 'blueberry', 'audi', 8],
[17, None, 'blackberry', 'bmw', 1],
[18, 7, 'plum', 'porsche', 9]
]
def generate_uuid(use_uuid1: bool=True, use_uuid4: bool=False) -> str:
"""Helper function creating UUIDs.
Arguments:
use_uuid1: Whether generated UUID string should feature a UUID1 part.
Defaults to `True`.
use_uuid1: Whether generated UUID string should feature a UUID4 part.
Defaults to `False`.
Returns:
Universally unique identifier based on UUID1 and/or UUID4.
"""
uuid_str = ""
if not use_uuid1 and not use_uuid4:
raise ValueError("Both use_uuid1 and use_uuid4 are set to `False`, cannot create UUID.")
elif use_uuid1 and use_uuid4:
uuid_str += f"{str(uuid.uuid1())}-{str(uuid.uuid4())}"
elif use_uuid1:
uuid_str += str(uuid.uuid1())
else:
uuid_str += str(uuid.uuid4())
return uuid_str
df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
# STEP 1
df.sort_values(by='user_id', inplace=True)
df['_same_user_id'] = (
(df['user_id'] == df['user_id'].shift(-1))
& (df['user_id'] != df['user_id'].shift(1))
)
df['global_id'] = [generate_uuid() if value else np.NaN for value in df['_same_user_id'].values]
# STEP 2
df['global_id'] = df.groupby('user_id')['global_id'].ffill()
# STEP 3
df['_new_ids'] = [generate_uuid() if not np.isnan(value) else np.NaN for value in df['user_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# # STEP 4
df.sort_values(by='universal_id', inplace=True)
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
# STEP 5
df['_count_universal_id'] = df['universal_id'].groupby(df['universal_id']).transform('count')
df['_same_universal_id'] = (
(df['universal_id'] == df['universal_id'].shift(-1))
& (df['universal_id'] != df['universal_id'].shift(1))
)
df['_new_id_for_universal_id'] = (
df['_count_universal_id'].gt(1)
& (df['global_id'].isnull())
& df['_same_universal_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_universal_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# STEP 6
df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
# STEP 7
df.sort_values(by='session_id', inplace=True)
df['global_id'] = df.groupby('session_id')['global_id'].ffill()
# STEP 8
df['_count_session_id'] = df['session_id'].groupby(df['session_id']).transform('count')
df['_same_session_id'] = (
(df['session_id'] == df['session_id'].shift(-1))
& (df['session_id'] != df['session_id'].shift(1))
)
df['_new_id_for_session_id'] = (
df['_count_session_id'].gt(1)
& (df['global_id'].isnull())
& df['_same_session_id']
)
df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# STEP 9
df['global_id'] = df.groupby('session_id')['global_id'].ffill()
# STEP 10
df['_new_ids'] = [generate_uuid() if value == 1 else np.NaN for value in df['_count_session_id'].values]
df['global_id'].fillna(df['_new_ids'], inplace=True)
# DROP INTERNAL COLUMNS
cols_to_drop = [col for col in df.columns if col.startswith("_")]
df.drop(columns=cols_to_drop, inplace=True)
Results (since we're now using UUID1, the ID's look very similar but they are not the same).
| id | user_id | universal_id | session_id | expected_result | global_id |
|-----:|----------:|:---------------|:-------------|------------------:|:-------------------------------------|
| 2 | 1 | pear | bmw | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 17 | nan | blackberry | bmw | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 1 | 1 | apple | fiat | 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
| 3 | 2 | bananna | citroen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 14 | nan | apricot | volkswagen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 15 | 2 | apricot | volkswagen | 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
| 4 | 3 | bananna | kia | 3 | ee530d00-f0da-11ed-8f35-0242ac1c000c |
| 5 | 4 | blueberry | peugeot | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 6 | nan | blueberry | peugeot | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 7 | nan | blueberry | yamaha | 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
| 9 | nan | watermelon | ford | 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
| 8 | 5 | plum | ford | 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
| 16 | 6 | blueberry | audi | 8 | ee53100c-f0da-11ed-8f35-0242ac1c000c |
| 18 | 7 | plum | porsche | 9 | ee531106-f0da-11ed-8f35-0242ac1c000c |
| 10 | nan | raspberry | honda | 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
| 11 | nan | raspberry | toyota | 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
| 13 | nan | cherry | mercedes | 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
| 12 | nan | avocado | mercedes | 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |