Search code examples
runiqueidentifierunique-id

r - encode two variables to create a unique ID across two datasets


I have two datasets with overlapping but non-identical columns of strings for street address and apartment number, I would like to create the same unique identifier in the two datasets and then merge them with that identifier.

My first instinct is to do this for both datasets then merge:

df <- unidue(data_set1)
df$unit_id <- 1:nrow(df)
df_final <- merge(data_set1,df)

But because the two datasets are of nonequivalent sizes, I'm at a loss for what to do. I would imagine that I have to encode the letters in the street address and encode the apartment number to create an id but I don't know how to do so in R. For the sake of simplicity, I assume that all my addresses are numbers

Dataset 1

address unit 
1111        
1111    
2111    F
1114    2G
1311    

Dataset 2

address unit
1311   
2111    F
1111    6
1114    2G 

I would like to create a unique ID determined not by the sequence of the addresses but their contents.

Dataset 1

address unit id
1111         3333
1111         3333
2111    F    3334
1114    2G   3335
1311         3336

Dataset 2

address unit id
1311         3336
2111    F    3334
1111    6    3337
1114    2G   3335

And then merge so:

address unit id
1111         3333
1111         3333
1111    6    3337
2111    F    3334
1114    2G   3335
1311         3336

Solution

  • We create common values across both the dataset by getting the unique values from 'address' column in both dataset, use that as levels for converting the 'address' to factor and then coerce it to integer

    lvls <- unique(c(df1$address, df2$address))
    df1$id <- as.integer(factor(df1$address, levels = lvls)) + 3332
    df2$id <- as.integer(factor(df2$address, levels = lvls)) + 3332
    

    After that merge the two datasets by the 'id' column

    merge(df1, df2, all = TRUE, by = "id")