Search code examples
rdplyrsparse-matrixreshapetidyr

How to reshape a dataframe and paste multiple values in a single cell of a sparse matrix in R?


I have a dataframe like mentioned below. I want to create a sparse matrix whose row names and column names are unique station names and in the cells of sparse matrix I want all the train numbers for that particular station name. Here is the link of whole data

Train.No. train.Name       isl.no. station.code     Station.Name
'00851'   BNC SUVIDHA SPL   1      BBS              BHUBANESWAR    
'00851'   BNC SUVIDHA SPL   2      BAM              BRAHMAPUR      
'00851'   BNC SUVIDHA SPL   3      VSKP             VISAKHAPATNAM  
'00851'   BNC SUVIDHA SPL   4      BZA              VIJAYAWADA JN  
'00851'   BNC SUVIDHA SPL   5      RU               RENIGUNTA JN   
'00851'   BNC SUVIDHA SPL   6      JTJ              JOLARPETTAI    
'00851'   BNC SUVIDHA SPL   7      BNC              BANGALORE CANT 
'00852'   BNC BBS SUVIDHA   1      BNC              BANGALORE CANT 
'00852'   BNC BBS SUVIDHA   2      JTJ              JOLARPETTAI    

I am getting the desired output using the below code but this process is too much time taking as this is not a sparse matrix with dimension 4337*4337.

r1 <- rail

mat_n <- matrix(data = NA, nrow = length(unique(r1$Station.Name)), 
ncol = length(unique(r1$Station.Name)))
rownames(mat_n) <- unique(r1$Station.Name)
colnames(mat_n) <- unique(r1$Station.Name)

a1 <- unique(r1$Train.No.)

for(k in 1:length(a1)){
fd1 <- grep(a1[k], r1$Train.No.)
for(i in 1:nrow(mat_n)){
sta1 <- rownames(mat_n)[i]
for(j in 1:ncol(mat_n)){
  if(i != j){
    sta2 <- colnames(mat_n)[j]
    if(length(grep(sta1, r1$Station.Name[fd1[1]]))>0 & length(grep(sta2, r1$Station.Name[fd1[1]:fd1[length(fd1)]]))>0){
      mat_n[i,j] <-paste(mat_n[i,j], a1[k])}

  }
}
}
}

What is the alternate of the same using packages like reshape2, dplyr, tidyr etc? I searched for the same but got nothing which give me the desired output. This is the form of desired output I want.


Solution

  • Starting with your initial dataframe r1, this dplyr/tidyr solution may get you the results you need

    library(dplyr)
    library(tidyr)
    
    r1 <- r1 %>% 
      arrange(Train.No., isl.no.) %>%
      group_by(Train.No.) %>%
      mutate(Start.Station = first(Station.Name)) %>%
      ungroup() %>% 
      mutate(rownum = row_number()) %>%
      spread(Station.Name, Train.No.) %>%
      select(-train.Name, -isl.no., -station.code, -rownum) %>%
      group_by(Start.Station) %>%
      summarise_each(funs(paste(na.omit(.), collapse = " ")))
    
    
    row.names(r1) <- r1$Start.Station
    r1$Start.Station <- NULL