I want to use the data attached to see i there is correlation between a bootcamp that students attended and the job they end up getting. For example, does someone who attended a software engineering bootcamp end up with a software job, or does attending a data science one lead to a job in data? I have tried doing this but I dont think its right. I have attached a screenshot of the data.Please help with correct code
data <- data[rowSums(is.na(data)) == 0,]
summary(data)
data <- as.data.frame.matrix(data)
sapply(data,class)
data$Bootcamp <- as.numeric(factor(data$Bootcamp))
sapply(data,class)
data <- data[rowSums(is.na(data)) == 0,]
Here is how you can compute correlation (remember correlation is not causation, there can be confounders). Since I don't have access to your data, I started by generating some random data, which looks like the following (you can replace it with your actual data).
head(data)
# Bootcamp software web data security engineer developer analyst
#1 Data Science 0 1 0 0 0 1 1
#2 Data Science 1 1 1 0 1 1 1
#3 Cybersecurity 1 1 0 1 0 0 1
#4 Cybersecurity 0 0 0 1 1 0 1
#5 Cybersecurity 0 1 0 1 0 0 0
#6 Data Science 0 1 0 1 0 0 1
Now, use the function model.matrix()
which creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables, to create dummy binary variables from the categorical column.
bootcamp <- as.data.frame(model.matrix(~ Bootcamp + 0, data)) # with no intercept term
head(bootcamp)
# BootcampCybersecurity BootcampData Science BootcampSoftware Engineering
#1 0 1 0
#2 0 1 0
#3 1 0 0
#4 1 0 0
#5 1 0 0
#6 0 1 0
Note that the first row has Bootcamp
value as Data science
, hence only the corresponding dummy variable has value 1
, all others have value 0
for the row.
Note that it generated only 3 dummy column variables for me, since I had only 3 levels of the corresponding factor variable that is expanded. You will have number of columns as number of levels in the factor variable.
Now, compute the correlation:
job <- data[,2:ncol(data)]
corr <- cor(bootcamp, job)
You can use fancy plot for better visualization / interpretation if your want like the following:
library(ggcorrplot)
ggcorrplot(corr, lab = TRUE)
Note from the above visualization that with my data, the correlation of the binary variable representing a data job with the binary variable representing data science bootcamp is 0.1
You can do linear regression to find whether a particular predictor (e.g., bootcamp training) is significant one to predict the response (e.g., the job type). Hope it answers your question.