Search code examples
rggplot2heatmapcontoursignificance

Drawing a contour line around connected cells in a heatmap in R


I have data with two time axes and measurements for each cell. From this I create a heatmap. I also know for each cell whether the measurement is significant.

My problem is to draw a contour line around all cells that are significant. If cells form clusters with the same significance value, I need to draw the contour around the cluster and not around each individual cell.

The data are in this format:

   x_time y_time    metric signif
1       1      1 0.3422285  FALSE
2       2      1 0.6114085  FALSE
3       3      1 0.5381621  FALSE
4       4      1 0.5175120  FALSE
5       1      2 0.6997991  FALSE
6       2      2 0.3054885  FALSE
7       3      2 0.8353888   TRUE
8       4      2 0.3991566   TRUE
9       1      3 0.7522728   TRUE
10      2      3 0.5311418   TRUE
11      3      3 0.4972816   TRUE
12      4      3 0.4330033   TRUE
13      1      4 0.5157972   TRUE
14      2      4 0.6324151   TRUE
15      3      4 0.4734126   TRUE
16      4      4 0.4315119   TRUE

The code below generates this data, where the measurements are random (dt$metrics) and the significance is logical (dt$signif).

# data example
dt <- data.frame(x_time=rep(seq(1, 4), 4), 
                 y_time=rep(seq(1, 4), each=4),
                 metric=(rnorm(16, 0.5, 0.2)),
                 signif=c(rep(FALSE, 6), rep(TRUE, 10)))

The heatmap alone can be generated using ggplot2's geom_tile

# Generate heatmap using ggplot2's geom_tile
library(ggplot2)
p <- ggplot(data = dt, aes(x = x_time, y = y_time))
p <- p + geom_tile(aes(fill = metric))

Based on this question, I managed to draw contours with different colors around each cell according to the significance value.

# Heatmap with lines around each significant cell
p <- ggplot(data = dt, aes(x = x_time, y = y_time))
p <- p + geom_tile(aes(fill = metric, color = signif), size = 2)
p <- p + scale_color_manual(values = c("black", "white"))

This Figure displays the result of this approach.

However, this approach does not group adjacent significant cells together by drawing a contour around the entire group (as is also discussed in the question I linked to).

As this question shows, it is possible to draw boxes around specified areas, but I do not think this can be extended to all possible clusters of cells.


Solution

  • This answer is based on How to get contour lines around the grids in R-raster?.

    library(data.table)
    library(raster)
    

    Also note that clump requires that igraph package is installed, and dissolve = TRUE in rasterToPolygons requires rgeos.

    # convert data.frame to data.table
    # not strictly necessary, but enables use of convenient functions: dcast and rbindlist.
    setDT(d)
    
    # reshape to wide 
    d2 <- dcast(d, y ~ x, value.var = "sig")
    
    # reverse order of rows to match raster order
    # remove first column
    # convert to matrix and then to raster
    r <- raster(as.matrix(d2[ , .SD[.N:1, -1]]),
                xmn = 0, xmx = ncol(d2) - 1, ymn = 0, ymx = ncol(d2) - 1)
    
    # detect clumps of connected cells of the value TRUE
    # convert raster to polygons
    # dissolve polygons into multi-polygons
    polys <- rasterToPolygons(clump(r), dissolve = TRUE)
    
    # grab coordinates of individual polygons and convert to a data.table
    # use idcol = TRUE to enable grouping of paths when plotting
    d_poly <- rbindlist(lapply(polys@polygons,
                               function(x) as.data.table(x@Polygons[[1]]@coords)),
                        idcol = TRUE)
    
    # plot an outline around each 'patch of significant values' using geom_path 
    ggplot(d, aes(x = x, y = y)) +
      geom_tile(aes(fill = z)) +
      geom_path(data = d_poly, aes(x = x + 0.5, y = y + 0.5, group = .id),
                size = 2, color = "red")
    

    enter image description here


    Data:

    d <- structure(list(x = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
                              3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L),
                        y = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
                              1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
                        sig = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE,
                                TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
                        z = c(0.96, 0.76, 0.14, 0.93, 0.39, 0.06, 0.99, 0.77,
                              0.7, 0.72, 0.08, 0.94, 0.98,  0.83, 0.12, 0.42)),
                   row.names = c(NA, -16L), class = "data.frame")