Search code examples
rmergecbind

Merge, cbind: How to merge better?


I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.

df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)

I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.

To give you an example, city 1 data looks like this:

seed(1234) 
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))

I have 50 cities data and I want to merge them to one column in df. I have tried the following code:

for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6))   # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)}    # merge to df

But if I run this, the loop will show warnings like this:

In merge.data.frame(df, citydata, by = c("city", "id"),  ... :
  column names ‘obs.x’, ‘obs.y’ are duplicated in the result

and it will create 50 columns, instead of one long column.

How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.


In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.

I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.


Solution

  • data.table is good for this:

    library(data.table)
    df <- data.table(df)
    > df
         city id
      1:    1  1
      2:    1  2
      3:    1  3
      4:    1  4
      5:    1  5
     ---
    496:   50  6
    497:   50  7
    498:   50  8
    499:   50  9
    500:   50 10
    

    I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.

    citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
    > citydata
         city id         obs
      1:    1  1  0.19168335
      2:    1  2  0.35753229
      3:    1  3  1.35707865
      4:    1  4  1.91871907
      5:    1  5 -0.56961647
     ---
    296:   50  2  0.30592659
    297:   50  3 -0.44989646
    298:   50  4  0.05359738
    299:   50  5 -0.57494269
    300:   50  6  0.09565473
    
    setkey(df,city,id)
    setkey(citydata,city,id)
    

    As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:

    df[citydata,obs:=i.obs]
    > df
         city id         obs
      1:    1  1  0.19168335
      2:    1  2  0.35753229
      3:    1  3  1.35707865
      4:    1  4  1.91871907
      5:    1  5 -0.56961647
     ---
    496:   50  6  0.09565473
    497:   50  7          NA
    498:   50  8          NA
    499:   50  9          NA
    500:   50 10          NA