Search code examples
rvariablesstatisticscorrelation

Correlation R between categoric and binary variables


I want to use the data attached to see i there is correlation between a bootcamp that students attended and the job they end up getting. For example, does someone who attended a software engineering bootcamp end up with a software job, or does attending a data science one lead to a job in data? I have tried doing this but I dont think its right. I have attached a screenshot of the data.Please help with correct code

data

data <- data[rowSums(is.na(data)) == 0,]
summary(data)
data <- as.data.frame.matrix(data)
sapply(data,class)
data$Bootcamp <- as.numeric(factor(data$Bootcamp))
sapply(data,class)
data <- data[rowSums(is.na(data)) == 0,]

Solution

  • Here is how you can compute correlation (remember correlation is not causation, there can be confounders). Since I don't have access to your data, I started by generating some random data, which looks like the following (you can replace it with your actual data).

    head(data)
    #       Bootcamp software web data security engineer developer analyst
    #1  Data Science        0   1    0        0        0         1       1
    #2  Data Science        1   1    1        0        1         1       1
    #3 Cybersecurity        1   1    0        1        0         0       1
    #4 Cybersecurity        0   0    0        1        1         0       1
    #5 Cybersecurity        0   1    0        1        0         0       0
    #6  Data Science        0   1    0        1        0         0       1
    

    Now, use the function model.matrix() which creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables, to create dummy binary variables from the categorical column.

    bootcamp <- as.data.frame(model.matrix(~ Bootcamp + 0, data)) # with no intercept term
    head(bootcamp)
    #  BootcampCybersecurity BootcampData Science BootcampSoftware Engineering
    #1                     0                    1                            0
    #2                     0                    1                            0
    #3                     1                    0                            0
    #4                     1                    0                            0
    #5                     1                    0                            0
    #6                     0                    1                            0
    

    Note that the first row has Bootcamp value as Data science, hence only the corresponding dummy variable has value 1, all others have value 0 for the row.

    Note that it generated only 3 dummy column variables for me, since I had only 3 levels of the corresponding factor variable that is expanded. You will have number of columns as number of levels in the factor variable.

    Now, compute the correlation:

    job <- data[,2:ncol(data)]
    corr <- cor(bootcamp, job)
    

    You can use fancy plot for better visualization / interpretation if your want like the following:

    library(ggcorrplot)
    ggcorrplot(corr, lab = TRUE)
    

    enter image description here

    Note from the above visualization that with my data, the correlation of the binary variable representing a data job with the binary variable representing data science bootcamp is 0.1

    You can do linear regression to find whether a particular predictor (e.g., bootcamp training) is significant one to predict the response (e.g., the job type). Hope it answers your question.