Search code examples
rregressioncategorical-data

Regression in R with categorical variables


I'm trying to understand regression in R. I'm trying to solve an exercise which has a 100 random male-female dataset like this:

sex     sbp      bmi
male     130     40.0
female   126     29.0
female   115     25.0
male     120     33.0
female   128     34.0
...

I want to get a numerical summary (0) plot the relation between sbp and bmi (1) and estimate beta1, beta2 and sigma parameters with R^2 (2). Then, check the goodness of the model (3) and get the confidence intervals (4)..

I think that sex is a categorical variable, so here it's my code:

as.numeric(framingham$sex) - 1
apply(framingham, 2, class)

#0
framingham$sex <- factor (framingham$sex)
levels (framingham$sex) <- c("female", "male")
resultadoNumerico <- compareGroups(~., data = framingham)
resumenNumerico <- createTable(resultadoNumerico)
resumenNumerico

# 1
framinghamMatrix <- data.matrix(framingham)
pairs(framinghamMatrix)
cor(framinghamMatrix)

#2
regre <- lm(sbp ~ bmi+sex, data = framingham)
regreSum <- summary(regre)
regreSum
# Sigma
regreSum$sigma
# Betas
regreSum$coefficients

#3
plot(framingham$bmi, framingham$sbp, xlab = "SBP", ylab = "BMI")
abline (regre)

But I think that I'm not doing things right... Could you help me? Thanks in advance...


Solution

  • To check the relation between variables try a plot called pairs.panels from psych library. It gives the distributions , scatter plot and correlation coefficients.

    library(psych)
    pairs.panels(framingham)
    

    The sex variable here is categorical hence convert it into factor and then provide as input to your linear regression model. By alphabetical order the first level in the factor becomes your reference level and hence in the summary of model you can see only levels other than the reference level (in this case female is base -reference level)

    framingham$sex<-as.factor(framingham$sex)
    

    Now create your linear model.

    model <- lm(sbp ~ bmi+sex, data = framingham)
    model
    summary(model)
    

    The summary gives the coefficients, intercept, standard error (95% confidence) , t-value and p-value( that indicates the significance of variables), Multiple R-squared (Goodness of fit) , Adjusted R-squared (Goodness of fit adjusted to model complexity) etc.