Search code examples
rdata-analysiscrosstab

Cross-tab in R with data.tables


Sorry if this question has been asked, I played with my toy data to learn to manipulate data.tables. My goal was from this data:

toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                        to=c("B","C","A","D","F","E","E","A","A","A","C",NA))

to arrive at this result:

final_matrix
     L    A    B    C    D    E    F
1:   A    3    1    2 <NA>    1 <NA>
2:   B    1    0 <NA> <NA> <NA> <NA>
3:   C    2 <NA>    0    1    1 <NA>
4:   D <NA> <NA>    1    0 <NA> <NA>
5:   E    1 <NA>    1 <NA>    1    1
6:   F <NA> <NA> <NA> <NA>    1    0
7: tot    7    1    4    1    4    1

(eventually also with zeros instead of NAs, but got bored). I suppose in STATA this would be an easy cross-tab, I have built a function then looped over the unique values in the cols (sigh :/) merged the tables and then added a final line with the totals. Now although I've learned a lot, I wonder what would the clean R way to obtain such cross-tabs be? since the following doesn't work:

table(toy_data$from,toy_data$to)
   
    A B C D E F
  A 3 1 1 0 1 0
  C 1 0 0 1 0 0
  E 0 0 1 0 1 1

Thanks. My function if you have general improvements or best practices I am super happy:

create_edge_cols<- function(dt,column){
  #this function takes a df and a column, 
  #computes the number of edges among this column and all the other in dt
  #returns a column (list) with the cross-tabulation of columns
  tot_edges_i = dim(dt[from==column|to==column][,.(to=na.omit(to))])[1] # E better! without NAs
  print(tot_edges_i)
  # now tabulate links of column
  tab = data.table(table(unlist(dt[(from==column&to!=column)|
                                           (from!=column&to==column)])))
  setnames(tab, "V1", "L")
  setnames(tab, "N", column)
  setorder(tab,"L")
  tab[L==column,column] = length(dt[to==column & to == from,from])
  #tab[,`:=`(L=L,column=column/as.numeric(tot_edges_i))]
  return(tab)
}

#this should be the first column of our table
first_column = data.table("L"=unique(toy_data[,c(to[!is.na(to)],from)]))

#loop through the values of the columns and merge to a unique df
for (col in sort(unique(toy_data[!is.na(to),c(to,from)]))){
  info_column = copy(create_edge_cols(toy_data,col))
  first_column = merge.data.table(first_column,info_column,all.x = TRUE,all.y = TRUE)
}

## function to set first row as name
header.true <- function(df) {
  names(df) <- as.character(unlist(df[1,]))
  df[-1,]
}
# this should be the last row of our matrix:
last_row = transpose(data.table(table(unlist(toy_data[!is.na(toy_data$to),c(from,to[to!=from])]))))
last_row = cbind(data.table(matrix(c("L","tot"), ncol=1)),last_row)
last_row = header.true(last_row)
last_row

# let's concatenate
final_matrix = rbind(first_column,last_row)
final_matrix

EDIT: solution suggested by previous answer now deleted:

library(igraph)
g <- graph_from_data_frame(na.omit(toy_data), directed = F)
am <- as_adjacency_matrix(g, type = "both")
addmargins(as.matrix(am[order(rownames(am)), order(colnames(am))]), 1)

Solution

  • Here is a way. What is missing in the question's table statement are factor levels, table is only processing what is in the data. Coerce the columns to factors with the same levels and assign NA to counts equal to zero.

    There is also a print issue, see the final two instructions. The default for S# class "table" method print is not to print NA's. This can be changed manually.

    library(data.table)
    
    toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                          to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
    
    levels <- sort(unique(unlist(toy_data)))
    levels <- levels[!is.na(levels)]
    toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
    tbl <- table(toy_data)
    is.na(tbl) <- tbl == 0
    tbl
    #>     to
    #> from  A  B  C  D  E  F
    #>    A  3  1  1     1   
    #>    B                  
    #>    C  1        1      
    #>    D                  
    #>    E        1     1  1
    #>    F
    
    print(tbl, na.print = NA)
    #>     to
    #> from    A    B    C    D    E    F
    #>    A    3    1    1 <NA>    1 <NA>
    #>    B <NA> <NA> <NA> <NA> <NA> <NA>
    #>    C    1 <NA> <NA>    1 <NA> <NA>
    #>    D <NA> <NA> <NA> <NA> <NA> <NA>
    #>    E <NA> <NA>    1 <NA>    1    1
    #>    F <NA> <NA> <NA> <NA> <NA> <NA>
    

    Created on 2022-03-28 by the reprex package (v2.0.1)


    Edit

    To add a column sums row at the bottom of the cross table, rbind the result above with colSums. Note that there's no longer need for print(tbl, na.print = NA), the method print (autoprint) being called is now the matrix method.

    library(data.table)
    
    toy_data = data.table(from=c("A","A","A","C","E","E","A","A","A","C","E","E"),
                          to=c("B","C","A","D","F","E","E","A","A","A","C",NA))
    
    levels <- sort(unique(unlist(toy_data)))
    levels <- levels[!is.na(levels)]
    toy_data[, c("from", "to") := lapply(.SD, factor, levels = levels)]
    tbl <- table(toy_data)
    
    class(tbl)  # check the output object class
    #> [1] "table"
    
    tbl <- rbind(tbl, tot = colSums(tbl, na.rm = TRUE))
    is.na(tbl) <- tbl == 0
    
    class(tbl)  # check the output object class, it's no longer "table"
    #> [1] "matrix" "array"
    
    tbl
    #>      A  B  C  D  E  F
    #> A    3  1  1 NA  1 NA
    #> B   NA NA NA NA NA NA
    #> C    1 NA NA  1 NA NA
    #> D   NA NA NA NA NA NA
    #> E   NA NA  1 NA  1  1
    #> F   NA NA NA NA NA NA
    #> tot  4  1  2  1  2  1
    

    Created on 2022-03-29 by the reprex package (v2.0.1)