Search code examples
pythonapache-sparkpyspark

Entity resolution - creating a unique identifier based on 3 columns


I'm trying to find a way to create a unique identifier by using 3 columns (user_id, universal_id and session_id). Column "expected_result" is what this unique_id should be after processing other 3 columns.

  • Sometimes user_id is not available, and in that case the other two columns should be used to create the unique id.
  • When user_id doesn't have a match and universal_id has a match, those should be treated as different (separate unique id).
  • "id" column is the order in which data is written into the database. If a new row shows up that matches any of the previous rows (with already calculated unique id) by any of the 3 columns, the already existing unique id should be added to the new row.

Here's a list of possible relationships between columns:

  • user_id:universal_id = 1:N OR N:1 (if N:1 then each N needs a unique_id)
  • user_id:session_id = 1:N
  • universal_id:session_id = 1:N or N:1

I'm trying to find a thing in python (or pyspark because I may be using this on millions of rows) that can help me do the clustering of this data (or however this process is called in data science). The idea is to create a map of universal_id:unique_id. If you know how this is done please help, or at least point me to a subject that I should research to be able to do this. Thanks!

I have Snowflake and Databricks at my disposal.

Here's my test dataset:

import pandas as pd

data = [
    [1, 1, 'apple', 'fiat', 1],
    [2, 1, 'pear', 'bmw', 1],
    [3, 2, 'bananna', 'citroen', 2],
    [4, 3, 'bananna', 'kia', 3],
    [5, 4, 'blueberry', 'peugeot', 4],
    [6, None, 'blueberry', 'peugeot', 4],
    [7, None, 'blueberry', 'yamaha', 4],
    [8, 5, 'plum', 'ford', 5],
    [9, None, 'watermelon', 'ford', 5],
    [10, None, 'raspberry', 'honda', 6],
    [11, None, 'raspberry', 'toyota', 6],
    [12, None, 'avocado', 'mercedes', 7],
    [13, None, 'cherry', 'mercedes', 7],
    [14, None, 'apricot', 'volkswagen', 2],
    [15, 2, 'apricot', 'volkswagen', 2],
    [16, 6, 'blueberry', 'audi', 8],
    [17, None, 'blackberry', 'bmw', 1],
    [18, 7, 'plum', 'porsche', 9]
]

df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])

Solution

  • Based on what you described, we can formulate an algorithm as follows, referring to the new ID as global_id. Update: The algorithm now features the arbitrary tie-break when multiple user_ids match multiple universal_ids. Update: Since you were concerned about the risk of duplicates using fully randomly generated UUID4s, I coded you a little function which allows you to generate a UUID leveraging both UUID1 and/or UUID4 - I personally would not be worried about clashes of UUID4 values whatsoever, but it's up to you.

    1. Create a new global_id for every user_id that has multiple occurrences (n>1)
    2. Propagate values for global_id to all rows with matching user_id
    3. Create a new global_id for every user_id that has single occurrence (n=1)
    4. Propagate values for global_id to all rows with matching universal_id, i.e. rows which don't match on user_id but match on universal_id. There is an arbitrary tie break if multiple universal_ids match on one or more user_ids, where all matching universal_ids are assigned to the same user_id
    5. Create a new global_id for every universal_id which cannot be linked to a user_id but has multiple occurrences (n>1)
    6. Propagate values for global_id to all rows with matching universal_id
    7. Propagate existing values for global_id to all rows with matching session_id, i.e. rows with don't match on neither user_id or universal_id but on session_id
    8. Create a new global_id for every session_id which cannot be linked to neither a user_id nor universal_id but has multiple occurrences (n>1)
    9. Propagate values for global_id to all rows with matching session_id
    10. (not needed in your example but might be useful): Create a new global_id for every session_id which does not have multiple occurrences (n=1)

    Hope this helps!

    import uuid
    import pandas as pd
    import numpy as np
    
    
    data = [
        [1, 1, 'apple', 'fiat', 1],
        [2, 1, 'pear', 'bmw', 1],
        [3, 2, 'bananna', 'citroen', 2],
        [4, 3, 'bananna', 'kia', 3],
        [5, 4, 'blueberry', 'peugeot', 4],
        [6, None, 'blueberry', 'peugeot', 4],
        [7, None, 'blueberry', 'yamaha', 4],
        [8, 5, 'plum', 'ford', 5],
        [9, None, 'watermelon', 'ford', 5],
        [10, None, 'raspberry', 'honda', 6],
        [11, None, 'raspberry', 'toyota', 6],
        [12, None, 'avocado', 'mercedes', 7],
        [13, None, 'cherry', 'mercedes', 7],
        [14, None, 'apricot', 'volkswagen', 2],
        [15, 2, 'apricot', 'volkswagen', 2],
        [16, 6, 'blueberry', 'audi', 8],
        [17, None, 'blackberry', 'bmw', 1],
        [18, 7, 'plum', 'porsche', 9]
    ]
    
    
    def generate_uuid(use_uuid1: bool=True, use_uuid4: bool=False) -> str:
        """Helper function creating UUIDs.
        
        Arguments:
            use_uuid1: Whether generated UUID string should feature a UUID1 part.
                Defaults to `True`.
            use_uuid1: Whether generated UUID string should feature a UUID4 part.
                Defaults to `False`.
        
        Returns:
            Universally unique identifier based on UUID1 and/or UUID4.
        """
        uuid_str = ""
    
        if not use_uuid1 and not use_uuid4:
            raise ValueError("Both use_uuid1 and use_uuid4 are set to `False`, cannot create UUID.")
    
        elif use_uuid1 and use_uuid4:
            uuid_str += f"{str(uuid.uuid1())}-{str(uuid.uuid4())}"
        elif use_uuid1:
            uuid_str += str(uuid.uuid1())
        else:
            uuid_str += str(uuid.uuid4())
        
        return uuid_str
    
    
    df = pd.DataFrame(data, columns=['id', 'user_id', 'universal_id', 'session_id', 'expected_result'])
    
    # STEP 1
    df.sort_values(by='user_id', inplace=True)
    df['_same_user_id'] = (
        (df['user_id'] == df['user_id'].shift(-1))
        & (df['user_id'] != df['user_id'].shift(1))
    )
    df['global_id'] = [generate_uuid() if value else np.NaN for value in df['_same_user_id'].values]
    
    # STEP 2
    df['global_id'] = df.groupby('user_id')['global_id'].ffill()
    
    # STEP 3
    df['_new_ids'] = [generate_uuid() if not np.isnan(value) else np.NaN for value in df['user_id'].values]
    df['global_id'].fillna(df['_new_ids'], inplace=True)
    
    # # STEP 4
    df.sort_values(by='universal_id', inplace=True)
    df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
    
    # STEP 5
    df['_count_universal_id'] = df['universal_id'].groupby(df['universal_id']).transform('count')
    df['_same_universal_id'] = (
        (df['universal_id'] == df['universal_id'].shift(-1))
        & (df['universal_id'] != df['universal_id'].shift(1))
    )
    df['_new_id_for_universal_id'] = (
        df['_count_universal_id'].gt(1)
        & (df['global_id'].isnull()) 
        & df['_same_universal_id']
    )
    df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_universal_id'].values]
    df['global_id'].fillna(df['_new_ids'], inplace=True)
    
    # STEP 6
    df['global_id'] = df.groupby('universal_id')['global_id'].ffill()
    
    
    # STEP 7
    df.sort_values(by='session_id', inplace=True)
    df['global_id'] = df.groupby('session_id')['global_id'].ffill()
    
    # STEP 8
    df['_count_session_id'] = df['session_id'].groupby(df['session_id']).transform('count')
    df['_same_session_id'] = (
        (df['session_id'] == df['session_id'].shift(-1))
        & (df['session_id'] != df['session_id'].shift(1))
    )
    df['_new_id_for_session_id'] = (
        df['_count_session_id'].gt(1)
        & (df['global_id'].isnull()) 
        & df['_same_session_id']
    )
    df['_new_ids'] = [generate_uuid() if value else np.NaN for value in df['_new_id_for_session_id'].values]
    df['global_id'].fillna(df['_new_ids'], inplace=True)
    
    # STEP 9
    df['global_id'] = df.groupby('session_id')['global_id'].ffill()
    
    # STEP 10
    df['_new_ids'] = [generate_uuid() if value == 1 else np.NaN for value in df['_count_session_id'].values]
    df['global_id'].fillna(df['_new_ids'], inplace=True)
    
    # DROP INTERNAL COLUMNS
    cols_to_drop = [col for col in df.columns if col.startswith("_")]
    df.drop(columns=cols_to_drop, inplace=True)
    

    Results (since we're now using UUID1, the ID's look very similar but they are not the same).

    |   id |   user_id | universal_id   | session_id   |   expected_result | global_id                            |
    |-----:|----------:|:---------------|:-------------|------------------:|:-------------------------------------|
    |    2 |         1 | pear           | bmw          |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
    |   17 |       nan | blackberry     | bmw          |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
    |    1 |         1 | apple          | fiat         |                 1 | ee52b80a-f0da-11ed-8f35-0242ac1c000c |
    |    3 |         2 | bananna        | citroen      |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
    |   14 |       nan | apricot        | volkswagen   |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
    |   15 |         2 | apricot        | volkswagen   |                 2 | ee52ba1c-f0da-11ed-8f35-0242ac1c000c |
    |    4 |         3 | bananna        | kia          |                 3 | ee530d00-f0da-11ed-8f35-0242ac1c000c |
    |    5 |         4 | blueberry      | peugeot      |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
    |    6 |       nan | blueberry      | peugeot      |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
    |    7 |       nan | blueberry      | yamaha       |                 4 | ee530e04-f0da-11ed-8f35-0242ac1c000c |
    |    9 |       nan | watermelon     | ford         |                 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
    |    8 |         5 | plum           | ford         |                 5 | ee530f08-f0da-11ed-8f35-0242ac1c000c |
    |   16 |         6 | blueberry      | audi         |                 8 | ee53100c-f0da-11ed-8f35-0242ac1c000c |
    |   18 |         7 | plum           | porsche      |                 9 | ee531106-f0da-11ed-8f35-0242ac1c000c |
    |   10 |       nan | raspberry      | honda        |                 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
    |   11 |       nan | raspberry      | toyota       |                 6 | ee54a002-f0da-11ed-8f35-0242ac1c000c |
    |   13 |       nan | cherry         | mercedes     |                 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |
    |   12 |       nan | avocado        | mercedes     |                 7 | ee567260-f0da-11ed-8f35-0242ac1c000c |