Search code examples
rmatrixmetadatamds

How to link a dissimilarity matrix to metadata so I can plot an MDS with colored points in R?


I have a data matrix as a .csv (output from sourmash). The matrix looks something like this: matrix

I also have metadata that corresponds with that matrix. It groups the samples represented in the matrix several different ways. It looks something like this: metadata

I'd like to plot an MDS while coloring certain points based on their metadata value. So far I've been able to upload the matrix and plot the points, but am lost on how to "link" the metadata values to the matrix so that I can color the matrix values by color when they are plotted. I know it's probably a simple fix but would appreciate any help! This is what I have so far:

#import matrix and metadata
sm_matrix <- read.csv("path to .csv", header = TRUE, sep = ",")
md <- read.csv("path to .csv", header = TRUE, sep = ",")

#transform for plotting
sm_matrix <- as.matrix(sm_matrix)

#plot
mds <- sm_test %>%
  dist() %>%
cmdscale() %>%
  as_tibble()
colnames(mds) <- c("dim.1", "dim.2")

I've also tried this to plot

ggscatter(mds, x = "dim.1", y = "dim.2",
          color = md$Location,
          palette = "jco",
          size = 1, 
          ellipse = TRUE,
          ellipse.type = "convex",
          repel = TRUE)

but I get this error:

Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (92): colour
Run `rlang::last_error()` to see where the error occurred.
Warning message:
In if (color %in% names(data) & is.null(add.params$color)) add.params$color <- color :
  the condition has length > 1 and only the first element will be used

Thank you!

Sam


Solution

  • Here an approach that works. A warning of ggscatter remains, but a warning is not an error and it may be an issue of the package.

    First, the data are created directly in the script. This is the preferred way, because otherwise people have to invest additional work to type the data from the screenshots. In addition, it is also good style to mention the used packages explicitly.

    The script itself uses two tricks. First, names are added after calling as_tibble with setNames. The other trick is to convert the character variable Location into a numeric by converting it first to a factor and then a numeric. Furthermore, I increased sizeto 4, to make the result better visible.

    library("dplyr")
    library("ggpubr")
    
    sm_matrix  <- matrix(c(1, 0.2, 0.7, 0.2, 1, 0.2, 0.3, 0.2, 1), nrow=3)
    rownames(sm_matrix ) <- colnames(sm_matrix) <- c("sample_1", "sample_2", "sample_3")
    
    md <- as.data.frame(matrix(c("sample1", "sample2", "sample3", LETTERS[1:9]), nrow=3))
    colnames(md) <- c("SampleID", "Diet", "Location", "Size")
    
    mds <- sm_matrix %>%
      dist() %>%
      cmdscale() %>%
      as_tibble() %>%
      setNames(c("dim.1", "dim.2"))
    
    plot(mds)
    
    ggscatter(mds, x = "dim.1", y = "dim.2",
              color = as.numeric(as.factor(md$Location)),
              palette = "jco",
              size = 4, 
              ellipse = TRUE,
              ellipse.type = "convex",
              repel = TRUE)