Search code examples
rfor-looppdfheatmapdna-sequence

Plot DNA nucleotide data using loop and PDF graphics in R


My boss asked me to plot a matrix of DNA nucleotides using the pdf graphics function in R. I have a bit of code I'm working with but I can't figure it out and have spent way too much time trying! I know there are likely other methods/packages out there to visualize this genetic data, and I am absolutely interested in hearing them, but I also need to do this the way it was assigned to me.

I have sequence data in R like so:

> head(b)

Sequence                            X236 X237 X238 X239 X240 X241 X242 X244 X246 X247 X248 X249 X250 X251 X252 X253 X254 X255 X256 X257 X258 X259
1    L19088.1                         G    G    G    G    G    A    G    A    C    C    A    A    G    A    T    G    G    C    C    G    A    A   
2    chr1_43580199_43586187           ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    ·    g    g   

There are a total of 1040 rows and 483 columns, with character possibilities A, a, G, g, T, t, C, c, mid-dot, or X.

I want to color the different characters and plot them in a way that is similar to a heatmap. The dots and the Xs don't need to be colored. The code I am working with so far is:

pdf( 
  sprintf( 
    "%s/L1.pdf", 
    out_dir),
  width = 8.5, height = 11 )
par(omi = rep(0.5,4))
par(mai = rep(0.5,4))
par(bg  = "#eeeeee")
plot( NULL, 
      xlim = c(1,100), ylim = c(1,140), 
      xlab = NA,       ylab = NA, 
      xaxt = "n",      yaxt = "n",
      bty = "n",       asp = 1 )

plot_width <- 100
w <- plot_width / 600


genome_colors <- list()
genome_colors[["A"]] <- "#ea0064"
genome_colors[["a"]] <- "#ea0064"
genome_colors[["C"]] <- "#008a3f"
genome_colors[["c"]] <- "#008a3f"
genome_colors[["G"]] <- "#116eff"
genome_colors[["g"]] <- "#116eff"
genome_colors[["T"]] <- "#cf00dc"
genome_colors[["t"]] <- "#cf00dc"

I <- nrow(b)
J <- ncol(b)
for ( i in 1:I ){
 for ( j in i:J ){
 # plot nucleotide as rectangle with color and text label, something like:

 # plot nucleotides with genome_colors
 # rect( (j-1)*w, top-(i-1)*w, j*w, top-i*w, col = color, border = NA )
 }
 # text( (j+1)*w, top-(i-1)*w, labels = i, cex = 0.05, col = "#dddddd" )
}
dev.off()

If anyone can help me with the plotting loop or point me in a helpful direction I will be so thankful!


Solution

  • assuming df is your dataframe in wide format (one column per position, one row per sequence), example:

    df <- structure(list(sequence = c("L19088.1", "chr1_43580199_43586187"
    ), X236 = c("G", "."), X237 = c("G", "."), X238 = c("A", "a"), 
        X239 = c("T", "C"), X240 = c("A", "c"), X241 = c("G", "G"
        )), class = "data.frame", row.names = 1:2)
    
    ## > df
    ##                 sequence X236 X237 X238 X239 X240 X241
    ## 1               L19088.1    G    G    A    T    A    G
    ## 2 chr1_43580199_43586187    .    .    a    C    c    G
    

    ... you can use packages ggplot2 and tidyr from the tidyverse like so:

    library(tidyr)
    library(ggplot2)
    
    df %>%
      ## reshape to long table
      ## (one column each for sequence, position and nucleotide):
      pivot_longer(-sequence, ## stack all columns *except* sequence
                   names_to = 'position',
                   values_to = 'nucleotide'
                   ) %>%
      ## create the plot:
      ggplot() +
      geom_tile(aes(x = position, y = sequence, fill = nucleotide),
                  height = .9 ## adjust to visually separate sequences
                ) +
      scale_fill_manual(values = c('A'='#ea0064', 'a'='#ea0064', 'C'='#008a3f',
                                  'c'='#008a3f', 'G'='#116eff', 'g'='#116eff',
                                  'T'='#cf00dc', 't'='#cf00dc', '.'='#a0a0a0'
                                  )
                        ) +
      labs(x = 'x-axis-title', y='y-axis-title') +
      ## remove x-axis (=position) elements: they'll probably be too dense:
      theme(axis.title.x = element_blank(),
            axis.text.x = element_blank(),
            axis.ticks.x = element_blank()
            )
    

    ^^^ for easy styling see e.g. ggplot themes

    save plot with convenience wrapper ggsave:

      ggsave(filename = 'my_plot.pdf',
           width = 12, ## inches; to fill DIN A4 landscape
           height = 8
           )
    
    
    
    

    When using the pdf() function, don't forget to explicitly print your plot:

    pdf(file = 'my_plot.pdf',
        ## ... other parameters
    )
    print( ## you need to print the plot
      qplot(data = cars, x = speed, y = dist, geom = 'point')
    )
    dev.off()