Search code examples
rggplot2meanline-plot

ggplot: lineplot of means of two groups


I have searched and searched in the stacks for an answer to my question; this one approaches my question but I have been unsuccessful in modifying the code to fix my graph.

I have data, reshaped in long format, that looks like this:

ID          Var1      GenePosition   ContinuousOutcomeVar
1           control      X20068492 0.092813611
2           control      X20068492 0.001746708
3           case         X20068492 0.069251157
4           case         X20068492 0.003639304

Each ID has one value for ContinuousOutcomeVar per position, and there are 86 positions and 10 IDs. I want to plot a line graph with position on the x axis and the continuous outcome variable on the y axis. I want two groups: a case group and control group, so there should be two dots for every position: one is the mean value for cases, and one is the mean value for controls. Then I want a line that connects the cases, and a line that connects the controls. I know this is easy, but I'm new to R - I've been working at it for 8 hours and I can't quite get it right. Below is what I have; I'd really appreciate some insight. If this exists somewhere in the stacks, I really apologize...I honestly looked all over and tried modifying a lot of code but still haven't gotten it right.

My code: This code plots all the values for all IDs at each position, and connects them for the two groups. It gives me a black dot at the mean of all 10 values per position (I think):

lineplot <- ggplot(data=seq.long, aes(x=Position, y=PMethyl, 
    group=CACO, colour=CACO)) +
    stat_summary (fun.y=mean, geom="point", aes(group=1), color="black") +      
    geom_line() + geom_point()

I can't get R to not plot all 10 points; just two means (one per case/control group) per position, with cases' & controls' values each connected by a line across the x axis.


Solution

  • First, adjusted your original sample data to contain more than one unique GenePosition.

    dput(seq.long)
    structure(list(ID = 1:8, Var1 = structure(c(2L, 2L, 1L, 1L, 2L, 
    2L, 1L, 1L), .Label = c("case", "control"), class = "factor"), 
        GenePosition = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
        ), .Label = c("X20068492", "X20068493"), class = "factor"), 
        ContinuousOutcomeVar = c(0.092813611, 0.001746708, 0.069251157, 
        0.003639304, 0.112813611, 0.002746708, 0.089251157, 0.004639304
        )), .Names = c("ID", "Var1", "GenePosition", "ContinuousOutcomeVar"
    ), class = "data.frame", row.names = c(NA, -8L))
    

    If you just want to represent one value for each GenePosition and Var1 combination then it would be easier to calculate mean values before plotting. That can be achieved with function ddply() from library plyr.

    library(plyr)    
    seq.long.sum<-ddply(seq.long,.(Var1,GenePosition),
           summarize, value = mean(ContinuousOutcomeVar))
    seq.long.sum
         Var1 GenePosition      value
    1    case    X20068492 0.03644523
    2    case    X20068493 0.04694523
    3 control    X20068492 0.04728016
    4 control    X20068493 0.05778016
    

    Now with this new data frame you just have to give x and y values. Var1 should be used in colour= and group= to ensure that each group has different color and that lines are connected.

    ggplot(seq.long.sum,aes(x=GenePosition,y=value,colour=Var1,group=Var1))+
       geom_point()+geom_line()
    

    enter image description here