I'm working with a dataframe possessing over 7000 observations where each respondent has been allocated a numeric value that identifies their geographical location.
#DF 1
USER_ID Col2 ... NumIdentifier
45 4 101
12 9 98
97 19 7
11 3 104
54 1 109
2 23 110
... ... ...
Now I have been provided with additional information (Var1, Var2) that needs to be allocated to only some of the respondents based on this numeric geographic identifier.
#DF 2
NumIdentifer Var1 Var2
101 13 20
104 16 87
109 34 21
... ... ...
'DF 2' contains one row per numeric geographic identifier and contains a smaller subset of geographic identifiers than are present in 'DF 1'. There are around 30 rows in 'DF 2'.
As a first step I have constructed a new dataframe from 'DF 1' that only includes the respondents with the numeric identifier present in 'DF 2'.
#DF 3
USER_ID Col2 ... NumIdentifier
45 4 101
11 3 104
54 1 109
... ... ...
The ideal output I am aiming for would look like this. If the numerical identifier in 'DF 2' equals the numerical identifier in 'DF 3' then Var1 and Var2 for each corresponding row would be inserted.
#DF 3
USER_ID Col2 ... NumIdentifier Var1 Var2
45 4 101 13 20
11 3 104 16 87
54 1 109 34 21
... ... ... ... ...
Having tried unsuccessfully to transfer some of the techniques from these two posts:
I was wondering if there were some other resources or original insights that might be of help. The methods in these two posts seem to only be effective when making matches and conditional replacements between dataframes with an equal number of rows.
There are 3 approaches to this.
Use merge function (native)
Use merge in dplyr
Use sqldf library
My preference would be native/dplyr as sqldf actually converts your dataframes to a SQLite db and hence requires additional memory.
> df1 <- data.frame("NumIdentifier" = c(101,98,7,104,109,11), "USER_ID" = c(45,12,97,11,54,2), "Col2" = c(4,9,19,3,1,23))
> df1
NumIdentifier USER_ID Col2
1 101 45 4
2 98 12 9
3 7 97 19
4 104 11 3
5 109 54 1
6 11 2 23
> df2 <- data.frame("NumIdentifier" = c(101,104,109), "Var1" = c(13,16,34), "Var2" = c(20,87,21))
> df2
NumIdentifier Var1 Var2
1 101 13 20
2 104 16 87
3 109 34 21
Merge() function in dplyr
df3 <- merge(x = df1, y = df2, by = "NumIdentifier", all.y = TRUE)
> df3
NumIdentifier USER_ID Col2 Var1 Var2
1 101 45 4 13 20
2 104 11 3 16 87
3 109 54 1 34 21
sqldf
> library(sqldf)
> df4 <- sqldf("SELECT * FROM df2 LEFT JOIN df1 USING(NumIdentifier)")
> df4
NumIdentifier Var1 Var2 USER_ID Col2
1 101 13 20 45 4
2 104 16 87 11 3
3 109 34 21 54 1