Search code examples
rlogistic-regression

logistic regression in R for a list of SNPs to get summary statistics


I want to run a regression for a case-control outcome/disease (0,1 format for controls and cases) for a list of 30 SNPs (single nucleotide polymorphisms in 0, 1, 2 format for genotypes). I know how to do it for one SNP using following in R;

test = glm(casecontrol ~ rs12345, data=mydata, family=binomial)

Question: How do I run a model to get a summary statistics for an association of 30 SNPs with the disease in one go in R? Something like we get from GWAS, beta estimates, p values, SD, allele frequencies? Any package in R that I can use?

EDIT:

structure(list(ID = 1:6, sex = c(2L, 1L, 2L, 2L, 1L, 1L), age = c(49.65, 
48.56, 49.55, 55.23, 60.62, 60.19), bmi = c(18.09, 22.82, 31.31, 
21.87, 30.07, 26.75), casecontrol = c(1L, 0L, 0L, 1L, 0L, 0L), 
    rs1 = c(1L, 0L, 1L, 0L, 0L, 2L), rs2 = c(0L, 0L, 1L, 0L, 
    0L, 0L), rs3 = c(0L, 0L, 0L, 0L, 0L, 0L), rs4 = c(1L, 0L, 
    1L, 0L, 1L, 0L), rs5 = c(0L, 1L, 1L, 2L, 2L, 1L), rs6 = c(0L, 
    0L, 1L, 0L, 0L, 0L), rs7 = c(1L, 0L, 0L, 0L, 0L, 0L), rs8 = c(0L, 
    1L, 0L, 0L, 0L, 0L), rs9 = c(1L, 0L, 0L, 1L, 0L, 1L), rs10 = c(1L, 
    1L, 0L, 1L, 0L, 0L), rs11 = c(0L, 1L, 1L, 0L, 0L, 0L), rs12 = c(0L, 
    0L, 0L, 1L, 1L, 0L), rs13 = c(1L, 0L, 1L, 1L, 1L, 0L), rs14 = c(0L, 
    1L, 0L, 1L, 0L, 0L), rs15 = c(0L, 0L, 0L, 0L, 0L, 0L), rs16 = c(0L, 
    0L, 0L, 0L, 2L, 0L), rs17 = c(2L, 1L, 1L, 1L, 0L, 0L), rs18 = c(0L, 
    0L, 0L, 1L, 0L, 1L), rs19 = c(0L, 0L, 0L, 0L, 1L, 0L), rs20 = c(0L, 
    1L, 1L, 0L, 0L, 0L), rs21 = c(1L, 1L, 1L, 2L, 0L, 0L), rs22 = c(1L, 
    1L, 1L, 0L, 0L, 0L), rs23 = c(0L, 0L, 0L, 0L, 1L, 0L), rs24 = c(0L, 
    0L, 1L, 1L, 0L, 1L), rs25 = c(1L, 0L, 1L, 1L, 0L, 0L), rs26 = c(0L, 
    1L, 0L, 0L, 0L, 0L), rs27 = c(0L, 0L, 0L, 1L, 0L, 0L), rs28 = c(1L, 
    0L, 0L, 2L, 1L, 0L), rs29 = c(0L, 0L, 0L, 1L, 0L, 0L), rs30 = c(1L, 
    1L, 0L, 0L, 0L, 0L)), row.names = c(NA, 6L), class = "data.frame")```

Solution

  • You're definitely on the right track. Using multiple predictors is as simple as adding them in one by one... So to just practice on the first 3 change the command as shown (more in a minute on making that easy with 30 predictors.

    EDITED to make sure we convert the SNPs to factors as opposed to integers assuming they are factors. Also a better toy dataset that converges

    library(dplyr)
    library(broom)
    
    glm(casecontrol ~ as.factor(rs1) + as.factor(rs2) + as.factor(rs3), data = mydata, family=binomial)
    #> 
    #> Call:  glm(formula = casecontrol ~ as.factor(rs1) + as.factor(rs2) + 
    #>     as.factor(rs3), family = binomial, data = mydata)
    #> 
    #> Coefficients:
    #>     (Intercept)  as.factor(rs1)1  as.factor(rs1)2  as.factor(rs2)1  
    #>         0.03811          0.13198         -0.20161          0.22642  
    #> as.factor(rs2)2  as.factor(rs3)1  as.factor(rs3)2  
    #>         0.10170         -0.22889         -0.03697  
    #> 
    #> Degrees of Freedom: 499 Total (i.e. Null);  493 Residual
    #> Null Deviance:       693 
    #> Residual Deviance: 688.4     AIC: 702.4
    
    
    
    

    Use the summary command to get p values etc.

    summary(glm(casecontrol ~ as.factor(rs1) + as.factor(rs2) + as.factor(rs3), data = mydata, family=binomial))
    #> 
    #> Call:
    #> glm(formula = casecontrol ~ as.factor(rs1) + as.factor(rs2) + 
    #>     as.factor(rs3), family = binomial, data = mydata)
    #> 
    #> Deviance Residuals: 
    #>    Min      1Q  Median      3Q     Max  
    #> -1.350  -1.193   1.014   1.161   1.348  
    #> 
    #> Coefficients:
    #>                 Estimate Std. Error z value Pr(>|z|)
    #> (Intercept)      0.03811    0.22619   0.168    0.866
    #> as.factor(rs1)1  0.13198    0.22276   0.592    0.554
    #> as.factor(rs1)2 -0.20161    0.22264  -0.906    0.365
    #> as.factor(rs2)1  0.22642    0.22111   1.024    0.306
    #> as.factor(rs2)2  0.10170    0.21969   0.463    0.643
    #> as.factor(rs3)1 -0.22889    0.21864  -1.047    0.295
    #> as.factor(rs3)2 -0.03697    0.22117  -0.167    0.867
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 693.02  on 499  degrees of freedom
    #> Residual deviance: 688.39  on 493  degrees of freedom
    #> AIC: 702.39
    #> 
    #> Number of Fisher Scoring iterations: 3
    

    Better yet use broom::tidy to get nice output

    tidy(glm(casecontrol ~ as.factor(rs1) + as.factor(rs2) + as.factor(rs3), data = mydata, family=binomial))
    #> # A tibble: 7 x 5
    #>   term            estimate std.error statistic p.value
    #>   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
    #> 1 (Intercept)       0.0381     0.226     0.168   0.866
    #> 2 as.factor(rs1)1   0.132      0.223     0.592   0.554
    #> 3 as.factor(rs1)2  -0.202      0.223    -0.906   0.365
    #> 4 as.factor(rs2)1   0.226      0.221     1.02    0.306
    #> 5 as.factor(rs2)2   0.102      0.220     0.463   0.643
    #> 6 as.factor(rs3)1  -0.229      0.219    -1.05    0.295
    #> 7 as.factor(rs3)2  -0.0370     0.221    -0.167   0.867
    

    Obviously with example data we won't get real answers.

    To be most efficient of your time create a temporary data set for analysis. We'll call it justanalyze that contains just the outcome and variables you actually want to use. Then we can use casecontrol ~ . to say casecontrol with everything else as a predictor.

    justanalyze <- 
      mydata %>% 
      select(casecontrol, rs1:rs30) %>% 
      mutate_at(vars(rs1:rs30), as.factor)
    
    # glm(casecontrol ~ ., data = justanalyze, family=binomial)
    # summary(glm(casecontrol ~ ., data = justanalyze, family=binomial))
    tidy(glm(casecontrol ~ ., data = justanalyze, family=binomial))
    #> # A tibble: 61 x 5
    #>    term        estimate std.error statistic p.value
    #>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
    #>  1 (Intercept) -0.493       0.782  -0.630    0.529 
    #>  2 rs11         0.143       0.249   0.574    0.566 
    #>  3 rs12        -0.157       0.244  -0.642    0.521 
    #>  4 rs21         0.106       0.248   0.428    0.669 
    #>  5 rs22         0.0427      0.243   0.176    0.860 
    #>  6 rs31        -0.231       0.238  -0.970    0.332 
    #>  7 rs32         0.00169     0.245   0.00690  0.994 
    #>  8 rs41        -0.259       0.244  -1.06     0.288 
    #>  9 rs42        -0.474       0.253  -1.87     0.0610
    #> 10 rs51         0.0148      0.256   0.0577   0.954 
    #> # … with 51 more rows
    

    Better made up data (example)

    set.seed(2020)
    mydata <- data.frame(ID = 1:100, 
                         sex = sample(1:2, size = 500, replace = TRUE), 
                         age = runif(100, min= 35, max = 70), 
                         bmi = runif(100, min= 15, max = 35), 
                         casecontrol = sample(0:1, size = 500, replace = TRUE), 
                         rs1 = sample(0:2, size = 500, replace = TRUE), 
                         rs2 = sample(0:2, size = 500, replace = TRUE),
                         rs3 = sample(0:2, size = 500, replace = TRUE),
                         rs4 = sample(0:2, size = 500, replace = TRUE),
                         rs5 = sample(0:2, size = 500, replace = TRUE),
                         rs6 = sample(0:2, size = 500, replace = TRUE),
                         rs7 = sample(0:2, size = 500, replace = TRUE),
                         rs8 = sample(0:2, size = 500, replace = TRUE),
                         rs9 = sample(0:2, size = 500, replace = TRUE),
                         rs10 = sample(0:2, size = 500, replace = TRUE),
                         rs11 = sample(0:2, size = 500, replace = TRUE),
                         rs12 = sample(0:2, size = 500, replace = TRUE),
                         rs13 = sample(0:2, size = 500, replace = TRUE),
                         rs14 = sample(0:2, size = 500, replace = TRUE),
                         rs15 = sample(0:2, size = 500, replace = TRUE),
                         rs16 = sample(0:2, size = 500, replace = TRUE),
                         rs17 = sample(0:2, size = 500, replace = TRUE),
                         rs18 = sample(0:2, size = 500, replace = TRUE),
                         rs19 = sample(0:2, size = 500, replace = TRUE),
                         rs20 = sample(0:2, size = 500, replace = TRUE),
                         rs21 = sample(0:2, size = 500, replace = TRUE),
                         rs22 = sample(0:2, size = 500, replace = TRUE),
                         rs23 = sample(0:2, size = 500, replace = TRUE),
                         rs24 = sample(0:2, size = 500, replace = TRUE),
                         rs25 = sample(0:2, size = 500, replace = TRUE),
                         rs26 = sample(0:2, size = 500, replace = TRUE),
                         rs27 = sample(0:2, size = 500, replace = TRUE),
                         rs28 = sample(0:2, size = 500, replace = TRUE),
                         rs29 = sample(0:2, size = 500, replace = TRUE),
                         rs30 = sample(0:2, size = 500, replace = TRUE)
    )
    
    # mydata