Search code examples
rp-value

How to find the p-value for two sets of data in R?


New to R, and I have two data sets -- they have the same x-axis values, but the y-axis varies.

I'm trying to find the correlation between the two. When I use R to draw the ablines through the scatter plot, it gives me two lines-of-best-fit that seemingly makes one data set higher than the other -- but I'd really like to know the p-value between these two data sets to know the effect.

After looking it up, it seems like I should use t.test -- but I'm unsure how to run them against each other.

For example, if I run:

t.test(t1$xaxis,t1$yaxis1)
t.test(t2$xaxis,t2$yaxis2)

It gives me the right means of x and y (t1: 16.84, 88.58 and t2: 14.79, 86.14) -- but for the rest, I'm not sure:

t1: t = -43.8061, df = 105.994, p-value < 2.2e-16

t2: t = -60.1593, df = 232.742, p-value < 2.2e-16

Obviously the p-values given are (a) microscopic, and (b) I don't know how to make it tell me about the data sets relationship with each other -- and not individually.

Any help is greatly appreciated -- thanks!


Solution

  • Since you asked for it, here is how I understand your problem.

    You have two groups of y values corresponding to identical x values. Here I assume that the relationship between y and x is linear. If it isn't you could transform your variables, use a non-linear model, an additive model, ...

    First let's simulate some data since you don't provide any:

    set.seed(42)
    x <- 1:20
    y1 <- 2.5 + 3 * x +rnorm(20)
    y2 <- 4 + 2.5 * x +rnorm(20)
    
    plot(y1~x, col="blue", ylab="y")
    points(y2~x, col="red")
    legend("topleft", legend=c("y1", "y2"), col=c("blue", "red"), pch=1)
    

    enter image description here

    Now, we want to know if the two samples differ. We can find out by fitting a model:

    DF <- cbind(stack(cbind.data.frame(y1, y2)), x)
    names(DF) <- c("y", "group", "x")
    
    fit <- lm(y~x*group, data=DF)
    summary(fit)
    
    Call:
    lm(formula = y ~ x * group, data = DF)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -2.2585 -0.4603 -0.1899  0.9008  2.2127 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  3.51769    0.55148   6.379 2.17e-07 ***
    x            2.92136    0.04604  63.457  < 2e-16 ***
    groupy2      0.67218    0.77991   0.862    0.394    
    x:groupy2   -0.46525    0.06511  -7.146 2.11e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.187 on 36 degrees of freedom
    Multiple R-squared:  0.9949,    Adjusted R-squared:  0.9945 
    F-statistic:  2333 on 3 and 36 DF,  p-value: < 2.2e-16
    

    The intercepts are not significantly different, but the slopes are. If group is a significant effect, we can test best by comparing with a model that doesn't consider group:

    fit0 <- lm(y~x, data=DF)
    anova(fit0, fit)
    
    Analysis of Variance Table
    
    Model 1: y ~ x
    Model 2: y ~ x * group
      Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
    1     38 300.196                                  
    2     36  50.738  2    249.46 88.498 1.267e-14 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

    As you see, the samples are different.