I have two datasets with overlapping but non-identical columns of strings for street address and apartment number, I would like to create the same unique identifier in the two datasets and then merge them with that identifier.
My first instinct is to do this for both datasets then merge:
df <- unidue(data_set1)
df$unit_id <- 1:nrow(df)
df_final <- merge(data_set1,df)
But because the two datasets are of nonequivalent sizes, I'm at a loss for what to do. I would imagine that I have to encode the letters in the street address and encode the apartment number to create an id but I don't know how to do so in R. For the sake of simplicity, I assume that all my addresses are numbers
Dataset 1
address unit
1111
1111
2111 F
1114 2G
1311
Dataset 2
address unit
1311
2111 F
1111 6
1114 2G
I would like to create a unique ID determined not by the sequence of the addresses but their contents.
Dataset 1
address unit id
1111 3333
1111 3333
2111 F 3334
1114 2G 3335
1311 3336
Dataset 2
address unit id
1311 3336
2111 F 3334
1111 6 3337
1114 2G 3335
And then merge so:
address unit id
1111 3333
1111 3333
1111 6 3337
2111 F 3334
1114 2G 3335
1311 3336
We create common values across both the dataset by getting the unique values from 'address' column in both dataset, use that as levels
for converting the 'address' to factor
and then coerce it to integer
lvls <- unique(c(df1$address, df2$address))
df1$id <- as.integer(factor(df1$address, levels = lvls)) + 3332
df2$id <- as.integer(factor(df2$address, levels = lvls)) + 3332
After that merge
the two datasets by the 'id' column
merge(df1, df2, all = TRUE, by = "id")