Search code examples
rdata-cleaning

R - Insert Rows Based on Matching Idenitifiers Between Dataframes of Different Lengths


I'm working with a dataframe possessing over 7000 observations where each respondent has been allocated a numeric value that identifies their geographical location.

#DF 1

USER_ID   Col2   ...   NumIdentifier
  45        4                101
  12        9                98
  97        19               7
  11        3                104
  54        1                109
  2         23               110
  ...       ...              ...

Now I have been provided with additional information (Var1, Var2) that needs to be allocated to only some of the respondents based on this numeric geographic identifier.

#DF 2

NumIdentifer   Var1   Var2
   101          13     20
   104          16     87
   109          34     21
   ...         ...     ...

'DF 2' contains one row per numeric geographic identifier and contains a smaller subset of geographic identifiers than are present in 'DF 1'. There are around 30 rows in 'DF 2'.

As a first step I have constructed a new dataframe from 'DF 1' that only includes the respondents with the numeric identifier present in 'DF 2'.

#DF 3
USER_ID   Col2   ...   NumIdentifier
  45        4                101
  11        3                104
  54        1                109
  ...       ...              ...

The ideal output I am aiming for would look like this. If the numerical identifier in 'DF 2' equals the numerical identifier in 'DF 3' then Var1 and Var2 for each corresponding row would be inserted.

#DF 3
USER_ID   Col2   ...   NumIdentifier   Var1   Var2
  45        4                101        13     20
  11        3                104        16     87
  54        1                109        34     21
  ...       ...              ...        ...   ...

Having tried unsuccessfully to transfer some of the techniques from these two posts:

  • "r - matching of two dataframes and insertion of rows from one in another on condition"
  • "R - replace values in dataframe based on two matching conditions"

I was wondering if there were some other resources or original insights that might be of help. The methods in these two posts seem to only be effective when making matches and conditional replacements between dataframes with an equal number of rows.


Solution

  • There are 3 approaches to this.

    1. Use merge function (native)

    2. Use merge in dplyr

    3. Use sqldf library

    My preference would be native/dplyr as sqldf actually converts your dataframes to a SQLite db and hence requires additional memory.

    > df1 <- data.frame("NumIdentifier" = c(101,98,7,104,109,11), "USER_ID" = c(45,12,97,11,54,2), "Col2" = c(4,9,19,3,1,23))
    > df1
      NumIdentifier USER_ID Col2
    1           101      45    4
    2            98      12    9
    3             7      97   19
    4           104      11    3
    5           109      54    1
    6            11       2   23
    
    > df2 <- data.frame("NumIdentifier" = c(101,104,109), "Var1" = c(13,16,34), "Var2" = c(20,87,21))
    
    > df2
      NumIdentifier Var1 Var2
    1           101   13   20
    2           104   16   87
    3           109   34   21
    
    

    Merge() function in dplyr

    df3 <- merge(x = df1, y = df2, by = "NumIdentifier", all.y = TRUE)
    > df3
      NumIdentifier USER_ID Col2 Var1 Var2
    1           101      45    4   13   20
    2           104      11    3   16   87
    3           109      54    1   34   21
    
    

    sqldf

    > library(sqldf)
    > df4 <- sqldf("SELECT * FROM df2 LEFT JOIN df1 USING(NumIdentifier)")
    > df4
      NumIdentifier Var1 Var2 USER_ID Col2
    1           101   13   20      45    4
    2           104   16   87      11    3
    3           109   34   21      54    1