Search code examples
rvenn-diagram

Parsing venn table to create Venn Diagram in R


I have tables with values for Venn Diagrams which I am trying to read into R and parse in order to plot with the VennDiagram package. My tables look like this:

H3K27AC.bed H3K4ME3.bed gencode.bed Total   Name
        X   19184   gencode.bed
    X       6843    H3K4ME3.bed
    X   X   3942    H3K4ME3.bed|gencode.bed
X           5097    H3K27AC.bed
X       X   1262    H3K27AC.bed|gencode.bed
X   X       4208    H3K27AC.bed|H3K4ME3.bed
X   X   X   9222    H3K27AC.bed|H3K4ME3.bed|gencode.bed

I can read the table in as a dataframe like this:

> venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
> venn_table_df
  H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
1                                   X 19184                         gencode.bed
2                       X              6843                         H3K4ME3.bed
3                       X           X  3942             H3K4ME3.bed|gencode.bed
4           X                          5097                         H3K27AC.bed
5           X                       X  1262             H3K27AC.bed|gencode.bed
6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

I can get the categories for the venn diagram from the table like this

> venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 
> venn_categories
[1] "H3K27AC.bed" "H3K4ME3.bed" "gencode.bed"

And I can even make a summary table that is a bit easier to read, like this:

> venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
> venn_summary
  Total                                Name
1 19184                         gencode.bed
2  6843                         H3K4ME3.bed
3  3942             H3K4ME3.bed|gencode.bed
4  5097                         H3K27AC.bed
5  1262             H3K27AC.bed|gencode.bed
6  4208             H3K27AC.bed|H3K4ME3.bed
7  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed

But what is stumping me is how to get the values out of the table and assign them correctly to the areas for the venn diagram. For reference, the triple venn function looks like this:

n1<-5097
n2<-6843
n3<-19184

n12<-4208
n13<-1262
n23<-3942

n123<-9222

venn <-draw.triple.venn(area1=n1+n12+n13+n123,
                        area2=n2+n23+n12+n123,
                        area3=n3+n23+n13+n123,
                        n12=n12+n123,
                        n13=n13+n123,
                        n23=n23+n123,
                        n123=n123,
                        category=venn_categories,
                        fill=c('red','blue','green'),
                        alpha=c(rep(0.3,3)))

But obviously this requires setting the values manually, which is not desirable since I have many of these data sets, and also need to scale it up to 4-way and 5-way Venn's. How can I get R to find the correct values for each field in the venn? I have tried multiple different methods using grep, grepl, and subsetting the dataframe for the rows that match the categories for each area of the plot, but this has not worked correctly. Any suggestions? BTW this data is output from the HOMER software package's mergePeaks program.


Solution

  • I think I figured it out, using regular expressions to search the table for the correct entries for the plot. Here is the full workflow:

    # load packages
    library('VennDiagram')
    library('gridExtra')
    
    # read in the venn text
    venn_table_df<-read.table(venn_table_file,header = TRUE,sep = "\t",stringsAsFactors = FALSE)
    venn_table_df
    

    looks like this:

    > venn_table_df
      H3K27AC.bed H3K4ME3.bed gencode.bed Total                                Name
    1                                   X 19184                         gencode.bed
    2                       X              6843                         H3K4ME3.bed
    3                       X           X  3942             H3K4ME3.bed|gencode.bed
    4           X                          5097                         H3K27AC.bed
    5           X                       X  1262             H3K27AC.bed|gencode.bed
    6           X           X              4208             H3K27AC.bed|H3K4ME3.bed
    7           X           X           X  9222 H3K27AC.bed|H3K4ME3.bed|gencode.bed
    
    > # recreate it with this btw
    > dput(venn_table_df)
    structure(list(H3K27AC.bed = c("", "", "", "X", "X", "X", "X"
    ), H3K4ME3.bed = c("", "X", "X", "", "", "X", "X"), gencode.bed = c("X", 
    "", "X", "", "X", "", "X"), Total = c(19184L, 6843L, 3942L, 5097L, 
    1262L, 4208L, 9222L), Name = c("gencode.bed", "H3K4ME3.bed", 
    "H3K4ME3.bed|gencode.bed", "H3K27AC.bed", "H3K27AC.bed|gencode.bed", 
    "H3K27AC.bed|H3K4ME3.bed", "H3K27AC.bed|H3K4ME3.bed|gencode.bed"
    )), .Names = c("H3K27AC.bed", "H3K4ME3.bed", "gencode.bed", "Total", 
    "Name"), class = "data.frame", row.names = c(NA, -7L))
    

    Then parse the table

    # get the venn categories
    venn_categories<-colnames(venn_table_df)[!colnames(venn_table_df) %in% c("Total","Name")] 
    
    
    # make a summary table
    venn_summary<-venn_table_df[!colnames(venn_table_df) %in% venn_categories]
    venn_summary
    
    # get the areas for the venn; add up all the overlaps that contain the given category 
    
    # area1
    area_n1<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    # area2
    area_n2<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    # area3
    area_n3<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    # n12
    area_n12<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    # n13
    area_n13<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    # n23
    area_n23<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    
    # n123
    area_n123<-sum(venn_summary[grep(pattern = paste0("(?=.*",venn_categories[1],")","(?=.*",venn_categories[2],")","(?=.*",venn_categories[3],")"),x = venn_summary$Name,perl = TRUE),][["Total"]])
    
    
    venn <-draw.triple.venn(area1=area_n1,
                            area2=area_n2,
                            area3=area_n3,
                            n12=area_n12,
                            n13=area_n13,
                            n23=area_n23,
                            n123=area_n123,
                            category=venn_categories,
                            fill=c('red','blue','green'),
                            alpha=c(rep(0.3,3)))
    

    The key was to use regular expressions to get only the table entries that include all of the categories for the venn area. This is a little more involved than I was hoping for, and will require manual setup to adapt to the four-way and five-way venns, but it works so far. I am open to other suggestions that might be able to simplify the process and scale up easier.