I want to run a regression for a case-control outcome/disease (0,1 format for controls and cases) for a list of 30 SNPs (single nucleotide polymorphisms in 0, 1, 2 format for genotypes). I know how to do it for one SNP using following in R;
test = glm(casecontrol ~ rs12345, data=mydata, family=binomial)
Question: How do I run a model to get a summary statistics for an association of 30 SNPs with the disease in one go in R? Something like we get from GWAS, beta estimates, p values, SD, allele frequencies? Any package in R that I can use?
EDIT:
structure(list(ID = 1:6, sex = c(2L, 1L, 2L, 2L, 1L, 1L), age = c(49.65,
48.56, 49.55, 55.23, 60.62, 60.19), bmi = c(18.09, 22.82, 31.31,
21.87, 30.07, 26.75), casecontrol = c(1L, 0L, 0L, 1L, 0L, 0L),
rs1 = c(1L, 0L, 1L, 0L, 0L, 2L), rs2 = c(0L, 0L, 1L, 0L,
0L, 0L), rs3 = c(0L, 0L, 0L, 0L, 0L, 0L), rs4 = c(1L, 0L,
1L, 0L, 1L, 0L), rs5 = c(0L, 1L, 1L, 2L, 2L, 1L), rs6 = c(0L,
0L, 1L, 0L, 0L, 0L), rs7 = c(1L, 0L, 0L, 0L, 0L, 0L), rs8 = c(0L,
1L, 0L, 0L, 0L, 0L), rs9 = c(1L, 0L, 0L, 1L, 0L, 1L), rs10 = c(1L,
1L, 0L, 1L, 0L, 0L), rs11 = c(0L, 1L, 1L, 0L, 0L, 0L), rs12 = c(0L,
0L, 0L, 1L, 1L, 0L), rs13 = c(1L, 0L, 1L, 1L, 1L, 0L), rs14 = c(0L,
1L, 0L, 1L, 0L, 0L), rs15 = c(0L, 0L, 0L, 0L, 0L, 0L), rs16 = c(0L,
0L, 0L, 0L, 2L, 0L), rs17 = c(2L, 1L, 1L, 1L, 0L, 0L), rs18 = c(0L,
0L, 0L, 1L, 0L, 1L), rs19 = c(0L, 0L, 0L, 0L, 1L, 0L), rs20 = c(0L,
1L, 1L, 0L, 0L, 0L), rs21 = c(1L, 1L, 1L, 2L, 0L, 0L), rs22 = c(1L,
1L, 1L, 0L, 0L, 0L), rs23 = c(0L, 0L, 0L, 0L, 1L, 0L), rs24 = c(0L,
0L, 1L, 1L, 0L, 1L), rs25 = c(1L, 0L, 1L, 1L, 0L, 0L), rs26 = c(0L,
1L, 0L, 0L, 0L, 0L), rs27 = c(0L, 0L, 0L, 1L, 0L, 0L), rs28 = c(1L,
0L, 0L, 2L, 1L, 0L), rs29 = c(0L, 0L, 0L, 1L, 0L, 0L), rs30 = c(1L,
1L, 0L, 0L, 0L, 0L)), row.names = c(NA, 6L), class = "data.frame")```
You're definitely on the right track. Using multiple predictors is as simple as adding them in one by one... So to just practice on the first 3 change the command as shown (more in a minute on making that easy with 30 predictors.
EDITED to make sure we convert the SNPs to factors as opposed to integers assuming they are factors. Also a better toy dataset that converges
library(dplyr)
library(broom)
glm(casecontrol ~ as.factor(rs1) + as.factor(rs2) + as.factor(rs3), data = mydata, family=binomial)
#>
#> Call: glm(formula = casecontrol ~ as.factor(rs1) + as.factor(rs2) +
#> as.factor(rs3), family = binomial, data = mydata)
#>
#> Coefficients:
#> (Intercept) as.factor(rs1)1 as.factor(rs1)2 as.factor(rs2)1
#> 0.03811 0.13198 -0.20161 0.22642
#> as.factor(rs2)2 as.factor(rs3)1 as.factor(rs3)2
#> 0.10170 -0.22889 -0.03697
#>
#> Degrees of Freedom: 499 Total (i.e. Null); 493 Residual
#> Null Deviance: 693
#> Residual Deviance: 688.4 AIC: 702.4
Use the summary
command to get p values etc.
summary(glm(casecontrol ~ as.factor(rs1) + as.factor(rs2) + as.factor(rs3), data = mydata, family=binomial))
#>
#> Call:
#> glm(formula = casecontrol ~ as.factor(rs1) + as.factor(rs2) +
#> as.factor(rs3), family = binomial, data = mydata)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.350 -1.193 1.014 1.161 1.348
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.03811 0.22619 0.168 0.866
#> as.factor(rs1)1 0.13198 0.22276 0.592 0.554
#> as.factor(rs1)2 -0.20161 0.22264 -0.906 0.365
#> as.factor(rs2)1 0.22642 0.22111 1.024 0.306
#> as.factor(rs2)2 0.10170 0.21969 0.463 0.643
#> as.factor(rs3)1 -0.22889 0.21864 -1.047 0.295
#> as.factor(rs3)2 -0.03697 0.22117 -0.167 0.867
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 693.02 on 499 degrees of freedom
#> Residual deviance: 688.39 on 493 degrees of freedom
#> AIC: 702.39
#>
#> Number of Fisher Scoring iterations: 3
Better yet use broom::tidy
to get nice output
tidy(glm(casecontrol ~ as.factor(rs1) + as.factor(rs2) + as.factor(rs3), data = mydata, family=binomial))
#> # A tibble: 7 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.0381 0.226 0.168 0.866
#> 2 as.factor(rs1)1 0.132 0.223 0.592 0.554
#> 3 as.factor(rs1)2 -0.202 0.223 -0.906 0.365
#> 4 as.factor(rs2)1 0.226 0.221 1.02 0.306
#> 5 as.factor(rs2)2 0.102 0.220 0.463 0.643
#> 6 as.factor(rs3)1 -0.229 0.219 -1.05 0.295
#> 7 as.factor(rs3)2 -0.0370 0.221 -0.167 0.867
Obviously with example data we won't get real answers.
To be most efficient of your time create a temporary data set for analysis. We'll call it justanalyze
that contains just the outcome and variables you actually want to use. Then we can use casecontrol ~ .
to say casecontrol with everything else as a predictor.
justanalyze <-
mydata %>%
select(casecontrol, rs1:rs30) %>%
mutate_at(vars(rs1:rs30), as.factor)
# glm(casecontrol ~ ., data = justanalyze, family=binomial)
# summary(glm(casecontrol ~ ., data = justanalyze, family=binomial))
tidy(glm(casecontrol ~ ., data = justanalyze, family=binomial))
#> # A tibble: 61 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -0.493 0.782 -0.630 0.529
#> 2 rs11 0.143 0.249 0.574 0.566
#> 3 rs12 -0.157 0.244 -0.642 0.521
#> 4 rs21 0.106 0.248 0.428 0.669
#> 5 rs22 0.0427 0.243 0.176 0.860
#> 6 rs31 -0.231 0.238 -0.970 0.332
#> 7 rs32 0.00169 0.245 0.00690 0.994
#> 8 rs41 -0.259 0.244 -1.06 0.288
#> 9 rs42 -0.474 0.253 -1.87 0.0610
#> 10 rs51 0.0148 0.256 0.0577 0.954
#> # … with 51 more rows
Better made up data (example)
set.seed(2020)
mydata <- data.frame(ID = 1:100,
sex = sample(1:2, size = 500, replace = TRUE),
age = runif(100, min= 35, max = 70),
bmi = runif(100, min= 15, max = 35),
casecontrol = sample(0:1, size = 500, replace = TRUE),
rs1 = sample(0:2, size = 500, replace = TRUE),
rs2 = sample(0:2, size = 500, replace = TRUE),
rs3 = sample(0:2, size = 500, replace = TRUE),
rs4 = sample(0:2, size = 500, replace = TRUE),
rs5 = sample(0:2, size = 500, replace = TRUE),
rs6 = sample(0:2, size = 500, replace = TRUE),
rs7 = sample(0:2, size = 500, replace = TRUE),
rs8 = sample(0:2, size = 500, replace = TRUE),
rs9 = sample(0:2, size = 500, replace = TRUE),
rs10 = sample(0:2, size = 500, replace = TRUE),
rs11 = sample(0:2, size = 500, replace = TRUE),
rs12 = sample(0:2, size = 500, replace = TRUE),
rs13 = sample(0:2, size = 500, replace = TRUE),
rs14 = sample(0:2, size = 500, replace = TRUE),
rs15 = sample(0:2, size = 500, replace = TRUE),
rs16 = sample(0:2, size = 500, replace = TRUE),
rs17 = sample(0:2, size = 500, replace = TRUE),
rs18 = sample(0:2, size = 500, replace = TRUE),
rs19 = sample(0:2, size = 500, replace = TRUE),
rs20 = sample(0:2, size = 500, replace = TRUE),
rs21 = sample(0:2, size = 500, replace = TRUE),
rs22 = sample(0:2, size = 500, replace = TRUE),
rs23 = sample(0:2, size = 500, replace = TRUE),
rs24 = sample(0:2, size = 500, replace = TRUE),
rs25 = sample(0:2, size = 500, replace = TRUE),
rs26 = sample(0:2, size = 500, replace = TRUE),
rs27 = sample(0:2, size = 500, replace = TRUE),
rs28 = sample(0:2, size = 500, replace = TRUE),
rs29 = sample(0:2, size = 500, replace = TRUE),
rs30 = sample(0:2, size = 500, replace = TRUE)
)
# mydata