I want to merge multiple vectors to a data frame. There are two variables, city
and id
that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6))
to df
in a one nice and long column? It seems both cbind
and merge
are not ways to go.
In case there are 50 citydata
(each has 6 rows), I can rbind
them as one long data and use data.table
approach or expand.gird
+merge
approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind
them and merge it to df
.
data.table
is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ
instead of your for loop to make some dummy data. CJ
cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city
and id
. The [,obs:=rnorm(.N)]
command adds a third column that draws random values (without recycling them as it would if it were inside the CJ
)--.N
means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df
by the key columns in citydata
, then defines obs
in df
by looking up the value in citydata
. Therefore the resulting object is the original df
but with obs
defined wherever it was defined in citydata
:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA