Search code examples
ro-d-matrix

Creating origin-destination matrices with R


My data frame consists of individuals and the city they live at a point in time. I would like to generate one origin-destination matrix for each year, which records the number of moves from one city to another. I would like to know:

  1. How can I generate the origin-destination tables for each year in my dataset automatically?
  2. How can I generate all tables in the same 5x5 format, 5 being the number of cities in my example?
  3. Is there a more efficient code than what I propose below? I intend to run it on a very large dataset.

Consider the following example:

#An example dataframe
id=sample(1:5,50,T)
year=sample(2005:2010,50,T)
city=sample(paste(rep("City",5),1:5,sep=""),50,T)
df=as.data.frame(cbind(id,year,city),stringsAsFactors=F)
df$year=as.numeric(df$year)
df=df[order(df$id,df$year),]
rm(id,year,city)

My best try

#Creating variables
for(i in 1:length(df$id)){
  df$origin[i]=df$city[i]
  df$destination[i]=df$city[i+1]
  df$move[i]=ifelse(df$orig[i]!=df$dest[i] & df$id[i]==df$id[i+1],1,0) #Checking whether a move has taken place and whether its the same person
  df$year_move[i]=ceiling((df$year[i]+df$year[i+1])/2) #I consider that the person has moved exactly between the two dates at which its location was recorded
}
df=df[df$move!=0,c("origin","destination","year_move")]    

Creating an origin-destination table for 2007

yr07=df[df$year_move==2007,]
table(yr07$origin,yr07$destination)

Result

        City1 City2 City3 City5
  City1     0     0     1     2
  City2     2     0     0     0
  City5     1     1     0     0

Solution

  • You can split your data from by id, perform the necessary computations on the id-specific data frame to grab all the moves from that person, and then re-combine:

    spl <- split(df, df$id)
    move.spl <- lapply(spl, function(x) {
      ret <- data.frame(from=head(x$city, -1), to=tail(x$city, -1),
                        year=ceiling((head(x$year, -1)+tail(x$year, -1))/2),
                        stringsAsFactors=FALSE)
      ret[ret$from != ret$to,]
    })
    (moves <- do.call(rbind, move.spl))
    #       from    to year
    # 1.1  City4 City2 2007
    # 1.2  City2 City1 2008
    # 1.3  City1 City5 2009
    # 1.4  City5 City4 2009
    # 1.5  City4 City2 2009
    # ...
    

    Because this code uses vectorized computations for each id, it should be a good deal quicker than looping through each row of your data frame as you did in the provided code.

    Now you could grab the year-specific 5x5 move matrices using split and table:

    moves$from <- factor(moves$from)
    moves$to <- factor(moves$to)
    lapply(split(moves, moves$year), function(x) table(x$from, x$to))
    # $`2005`
    #        
    #         City1 City2 City3 City4 City5
    #   City1     0     0     0     0     1
    #   City2     0     0     0     0     0
    #   City3     0     0     0     0     0
    #   City4     0     0     0     0     0
    #   City5     0     0     1     0     0
    # 
    # $`2006`
    #        
    #         City1 City2 City3 City4 City5
    #   City1     0     0     0     1     0
    #   City2     0     0     0     0     0
    #   City3     1     0     0     1     0
    #   City4     0     0     0     0     0
    #   City5     2     0     0     0     0
    # ...