Search code examples
pythonpandasdataframeuniqueidentifier

Assign unique identifier for dataframe rows based on dataframe with preassigned unique identifier


I have dataframe with unique identifier assigned based on three columns i.e., [col2,col3,col3]

Dataframe1:

col1      col2     col3     col4      col5         unique_id
1         abc       bcv      zxc      www.com        8
2         bcd       qwe      rty      www.@com       12
3         klp       oiu      ytr      www.io         15
4         zxc       qwe      rty      www.com        6

After data preprocessing, will import Dataframe_2 with same column values as shown above but without unique_id. Dataframe_2 rows must be assigned with unique identifier based on col2,col3,col4 and by referring to the Dataframe1.

If Dataframe_2 has new row which is not present in Dataframe1, then assign new identifier.

Dataframe_2:

col1      col2     col3     col4      col5         
1         bcd       qwe      rty      www.@com              
2         zxc       qwe      rty      www.com
3         abc       bcv      zxc      www.com 
4         kph       hir      mat      www.com            

Expected Dataframe_2:

col1      col2     col3     col4      col5         unique_id        
1         bcd       qwe      rty      www.@com        12     
2         zxc       qwe      rty      www.com         6
3         abc       bcv      zxc      www.com         8 
4         kph       hir      mat      www.com         35

Since Row4 is not present in Dataframe1, a new unique identifier is assigned.


Solution

  • # assign the old unique_id
    df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
             on=['col2', 'col3', 'col4', 'col5'], how='left')
    
    # assign new unique_id with max df1.unique_id + 1
    id_max = df1.unique_id.max() + 1
    null_num = df2n['unique_id'].isnull().sum()
    
    cond = df2n['unique_id'].isnull()
    df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
    df2n['unique_id'] = df2n['unique_id'].astype(int)
    
    print(df2n)
    
          col1 col2 col3 col4      col5  unique_id
        0     1  bcd  qwe  rty  www.@com         12
        1     2  zxc  qwe  rty   www.com          6
        2     3  abc  bcv  zxc   www.com          8
        3     4  kph  hir  mat   www.com         16