Search code examples
rdataframejoinsqldf

Replace missing data with values from matching rows in another dataframe


I want to replace values of a variable.

This is my dB :

head(mydb)

ID   V1     V2     V3     V4 
1    value  value  0    0
2    value  value  0    0  
3    value  value  0    0
4    value  value  0    0  
5    value  value  0    0
6    value  value  0    0  

V3 and V4 have 0 for all observations.

Then I created several dBs like below:

head(newdb)
ID   V3     V4 
2    5      4  
4    8      5  
6    9      6  

I want to obtain something like this :

ID   V1     V2     V3     V4 
1    value  value  0    0
2    value  value  5    4  
3    value  value  0    0
4    value  value  8    5  
5    value  value  0    0
6    value  value  9    6

I tried with to do it like this :

mydf <- sqldf('SELECT mydf.*, newdb.v3, newdb.v4              
               FROM mydf
               LEFT JOIN newdb 
               ON  mydf.ID = newdb.id')

The code I created works well and do its job; but the problem is that I'm inside a for loop, and for each of my newDB the sql code generates others V3 and V4 and attach it close the previous one creating something like this :

ID   V1     V2     V3   V4   V3   V4   V3  V4
1    value  value  0    0    1    5    0   0
2    value  value  5    4    0    0    0   0
3    value  value  0    0    0    0    7   8
4    value  value  8    5    0    0    0   0
5    value  value  0    0    2    2    0   0 
6    value  value  9    6    0    0    0   0

I've added another V3 and V4 columns for each iterations of my loop (in this case 3 iterations).

How can I avoid this problem?


Solution

  • You can simply match the IDs in two dataframes and replace them in the original one:

    mydf[,c("V3","V4")] <- newdf[match(mydf$ID, newdf$ID),c("V3","V4")]
    
    mydf
    #   ID  V1    V2   V3 V4
    # 1  1 value value NA NA
    # 2  2 value value  5  4
    # 3  3 value value NA NA
    # 4  4 value value  8  5
    # 5  5 value value NA NA
    # 6  6 value value  9  6
    

    Later you can replace NAs with 0s.

    Update:

    Instead of doing a for-loop concatenate all the newdfs together and then run the code on that; look at the pseudo-code below:

    newdf_concat <- rbind(newdf1, newdf2)
    
    mydf[,c("V3","V4")] <- newdf_concat[match(mydf$ID, newdf_concat$ID),c("V3","V4")]