Search code examples
rmatrixheatmappheatmap

R: Sparse? Transforming data for co-occurrence matrix


I'm a Bio major using R to generate some visualizations showing which human proteins (uniprots) are targeted by different bacterial strains.

# sample data
human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
                    "P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
                    "Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
                    "P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "C"), each = 5)
final <- cbind(human.uniprots, strains)

I'm trying to generate a co-occurrence matrix/heat map...something like

h.map <- data.frame(matrix(nrow = length(unique(human.uniprots)),
ncol = length(unique(strains)) + 1))
h.map.cols <- c("human_uniprots", "A", "B", "C")
colnames(h.map) <- h.map.cols

...where the columns have the strains, the rows have the proteins, and the data frame cells are populated with the counts of times that a protein interacts with a strain. So if strain A, B, and C all interact with a uniprot, they should all have a value of 3 in their cells for that uniprot row.

I've tried making a list of tuples of unique strain and human_uniprots, then searching for that tuple that matches the strain and human uniprot pair from the matrix I want to populate, and adding a "1" if there's a match...but I'm not sure how to work with tuples in R. Then I saw this: Populating a co-occurrence matrix

Which is what I want, but I'm not understanding the usage or syntax...is sparse() even a function in R?

Additionally...it would be nice to rank all the proteins by ones which interact with all strains. So all the proteins that interact with all the strains should be at the top, followed by ones that interact with 2 strains, and then 1 strain...


Solution

  • sparse() is a MATLAB function from the looks of it. You're describing a bipartite network represented by an incidence matrix.

    human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
                        "P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
                        "Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
                        "P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
    strains <- rep(c("A", "B", "C", "D"), each = 5)
    final <- cbind(human.uniprots, strains)
    
    final_df <- as.data.frame(final)
    
    library(igraph) # install.packages("igraph")
    g <- graph_from_data_frame(final_df, directed = FALSE)
    V(g)$type <- ifelse(V(g)$name %in% strains, FALSE, TRUE)
    
    as_incidence_matrix(g)
    #>   P15311 P0CG48 Q8WYH8 P42224 Q9NXR8 P40763 P05067 P60709 Q9UDW1 Q9H160
    #> A      1      1      1      1      1      0      0      0      0      0
    #> B      0      0      0      0      0      1      1      1      1      1
    #> C      0      0      0      0      0      0      0      0      0      0
    #> D      1      0      0      0      0      0      1      1      1      1
    #>   Q9UKL0 P26038 P61244 O95817 Q09472
    #> A      0      0      0      0      0
    #> B      0      0      0      0      0
    #> C      1      1      1      1      1
    #> D      0      0      0      0      0
    

    or.....

    V(g)$type <- ifelse(V(g)$name %in% strains, TRUE, FALSE)
                                            # swap TRUE/FALSE
    
    as_incidence_matrix(g)
    #>        A B C D
    #> P15311 1 0 0 1
    #> P0CG48 1 0 0 0
    #> Q8WYH8 1 0 0 0
    #> P42224 1 0 0 0
    #> Q9NXR8 1 0 0 0
    #> P40763 0 1 0 0
    #> P05067 0 1 0 1
    #> P60709 0 1 0 1
    #> Q9UDW1 0 1 0 1
    #> Q9H160 0 1 0 1
    #> Q9UKL0 0 0 1 0
    #> P26038 0 0 1 0
    #> P61244 0 0 1 0
    #> O95817 0 0 1 0
    #> Q09472 0 0 1 0
    

    Created on 2018-05-25 by the reprex package (v0.2.0).