Search code examples

R: Grouped boxplot with 2 X-variables, in each group compare all samples vs. one X2 group

I am trying to generate a grouped boxplot in ggplot2 with two x variables. This is straight-forward with

ggplot(boxplot_classes, aes(x=Group, y=Value, fill=Mutation)) + 

However, I do not need to compare the two subgroups defined by the second x-variable, but for each group defined by the first x-variable, I need to compare all samples in this group with one single subgroup from the second x variable.

Here an example. The data looks like this:

Value   Mutation    Group
32.00   Yes 1
5.00    no  1
18.00   no  1
3.00    no  1
16.00   no  1
14.00   Yes 1
28.00   Yes 1
28.00   Yes 1
49.00   Yes 1
15.00   Yes 1
43.00   no  2
49.00   Yes 2
40.00   Yes 2
17.00   Yes 2
9.00    no  2
31.00   Yes 2
8.00    Yes 2
43.00   no  2
50.00   Yes 2
48.00   Yes 2
11.00   Yes 3
42.00   no  3
0.00    Yes 3
15.00   Yes 3
8.00    no  3
1.00    Yes 3
41.00   no  3
15.00   no  3
4.00    no  3
31.00   Yes 3

I would like to generate a figure, were in each "Group" (in the example above: 1, 2, 3) two boxplots are generated: one for all samples in this "Group" and one only for those samples in this group, which also have mutation=="Yes". In the real data, many more "Groups are present".

I hope I could explain my problem well. Unfortunately I am somehow missing what the correct syntax is or how the data has to be rearranged.

Thank you very much for any help!

EDIT: I uploaded an example of the figure I am trying to generate at


  • If we play with your data a bit, we can do it. Suppose your data is in dat:

    dat_yes <- dat[dat$Mutation == 'Yes',] #subset only Yes
    dat_yes$Mutation_2 <- 'Yes' #add column
    dat$Mutation_2 <- 'All' #add column
    dat_full <- rbind(dat, dat_yes) #put together
    ggplot(dat_full, aes(x = factor(Group), y = Value))+
        geom_boxplot(aes(fill = Mutation_2))+
        xlab('Group') + 
        scale_fill_brewer(palette = 'Set1', name = 'Mutation')

    First, we create a subset of your data called dat_yes, which only contains the rows with Mutation == 'Yes'. We then create a new column in dat_yes called Mutation_2 which takes the value of 'Yes' only. We then add a column to your original data called Mutation_2 which only takes the value of 'All'. Then, we rbind dat and dat_yes to create dat_full. Finally, we send dat_full to ggplot.

    enter image description here


    dat <- structure(list(Value = c(32, 5, 18, 3, 16, 14, 28, 28, 49, 15, 
    43, 49, 40, 17, 9, 31, 8, 43, 50, 48, 11, 42, 0, 15, 8, 1, 41, 
    15, 4, 31), Mutation = c("Yes", "no", "no", "no", "no", "Yes", 
    "Yes", "Yes", "Yes", "Yes", "no", "Yes", "Yes", "Yes", "no", 
    "Yes", "Yes", "no", "Yes", "Yes", "Yes", "no", "Yes", "Yes", 
    "no", "Yes", "no", "no", "no", "Yes"), Group = c(1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L)), .Names = c("Value", 
    "Mutation", "Group"), class = "data.frame", row.names = c(NA, 