Search code examples
rggplot2geom-text

Labeling specific points on volcano plot


I am having trouble showing the points of interest on my ggplot. When I select for my genes in geom_label_repel() i only get 3 that appear when i know that there are 57 that show up in the list.

fig1. Here is what I am getting when running geom_label_repel

fig2. Here is what im getting with geom_text but i can't read the text easily over some of the blue. Trying to box them

fig3. I want figure 2 to look like this but show the genes in figure 2 with the box around it to make reading easier

Ive tried changing geom_label_repel to geom_text but the box shown in geom_label_repel makes the text easier to read. I have also isolated just the points of interest and graph that and it resulted in all my points being shown so i know the issue is somewhere in the label_repel line.

  results_df$Gene <- rownames(resultsLFC_df)
  
  #create named character vector for color palette
  palette <- c("Upregulated" = "red",
               "Downregulated" = "blue",
               "Not_significant" = "gray")

  # show volcano plot
  results_df %>%
    mutate(Expression = if_else(padj < custom_alpha & log2FoldChange > 0, "Upregulated",
                               if_else(padj < custom_alpha & log2FoldChange < 0, "Downregulated", "Not_significant"))) %>%
    ggplot(aes(x = log2FoldChange, y = -log10(padj), color = Expression)) +
    geom_point(alpha = 0.8, size = 0.5) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_hline(yintercept = -log10(custom_alpha), linetype = "dashed") +
    

    geom_text(
      aes(label = ifelse(Gene %in% genes_to_label, as.character(Gene), "")), color = "black",
      arrow = arrow(length = unit(0.02, "npc")),
      box.padding=.5, point.padding=0.5, segment.color="black", show.legend=FALSE, max.overlaps = 10,
      hjust=0,vjust=0) +

    
    # geom_label_repel(
    #   aes(label = if_else(Gene %in% genes_to_label, Gene, "")),
    #   arrow = arrow(length = unit(0.02, "npc")),
    #   box.padding=.1, point.padding=0.5, segment.color="gray70", show.legend=FALSE, max.overlaps = 20
    # ) +
    
    labs(title = condition_contrast, x = "log2(Fold Change)", y = "-log10(padj)") +
    scale_color_manual(values = palette, limits = names(palette))+
    theme_classic()

I have shown both of my geom_text and geom_label_repel as ive been trying to work through them.


Solution

  • Here are two options, the first using geom_text and the second using ggrepel::geom_label_repel.

    With geom_text changing the horizontal justification of the text helps a lot with the overplotting (I think).

    With geom_label_repel you can use the nudge_x and nudge_y arguments as well as change max_overlaps to something higher than 10. The reason your labels weren't showing up is probably from too low of a max_overlaps value. In my experience this works fine for reporting volcano plots even though it doesn't have labels right next to the point they describe since otherwise there has to be a ton of overplotting.

    set.seed(123)
    df <- data.frame(
      padj = runif(1000),
      log2FoldChange = rnorm(1000),
      Gene = paste0("G", 1:1000)
    )
    palette <- c("Upregulated" = "red",
                 "Downregulated" = "blue",
                 "Not_significant" = "gray")
    custom_alpha = 0.05
    df$Expression <- if_else(df$padj < custom_alpha & df$log2FoldChange > 0, "Upregulated",
                             if_else(df$padj < custom_alpha & df$log2FoldChange < 0,
                                     "Downregulated", "Not_significant"))
    genes_to_label <- df[df$Expression != "Not_significant", "Gene"]
    
    ggplot(df, aes(x = log2FoldChange, y = -log10(padj), color = Expression)) +
      geom_point(alpha = 0.8, size = 0.5) +
      geom_vline(xintercept = 0, linetype = "dashed") +
      geom_hline(yintercept = -log10(custom_alpha), linetype = "dashed") +
      geom_text(
        data = df[df$Gene %in% genes_to_label,],
        aes(label = Gene), color = "black",
        show.legend=FALSE,
        # varying hjust by row might help make this easier to read by biasing the down regulated labels
        # leftward and up regulated things righward. 
        hjust = ifelse(df[df$Gene %in% genes_to_label, "Expression"] == "Upregulated", 0, 1),
          size = 3, vjust = -0.5) +
      labs(title = "geom text", x = "log2(Fold Change)", y = "-log10(padj)") +
      scale_color_manual(values = palette, limits = names(palette))+
      theme_classic()
    
    
    ggplot(df, aes(x = log2FoldChange, y = -log10(padj), color = Expression)) +
      geom_point(alpha = 0.8, size = 0.5) +
      geom_vline(xintercept = 0, linetype = "dashed") +
      geom_hline(yintercept = -log10(custom_alpha), linetype = "dashed") +
      # here the geom_label_repel calls allows infinite overlaps and nudges points based on
      # the gene regulation (by 3 or -3). You could also just call geom_label_repel twice.
      ggrepel::geom_label_repel(
        data = df[df$Gene %in% genes_to_label,],
        aes(label = Gene),
        arrow = arrow(length = unit(0.02, "npc")),
        nudge_x = 3 * ifelse(df[df$Gene %in% genes_to_label, "Expression"] == "Upregulated", 1, -1),
        nudge_y = 1,
        box.padding=.1, point.padding=0.5, segment.color="gray70", show.legend=FALSE, max.overlaps = Inf
      ) +
      labs(title = "condition_contrast", x = "log2(Fold Change)", y = "-log10(padj)") +
      scale_color_manual(values = palette, limits = names(palette))+
      theme_classic()
    

    geom text version geom label repel version