I'm working on a dataset that looks like this.
#Dataframe
df=data.frame(Type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
Q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
Q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
Context
The dataframe consists of results from a questionnaire.
The first column, Type
, refers to the type of employee who answered the questionnaire, where 1 = 'Worker
', 2 = 'Factory Lead
', 3 = 'Administrative Staff
', 4 = 'Middle Management
' & 5 = 'Executive
'
The second and third column (Q1
& Q2
) are questions, rated on a scale of 1 = 'Strongly Agree
' to 10 (Strongly Disagree
).
What I'm Trying to Achieve
I'd like to compute the total number of responses per Type
, based on score.
I have created bins for the scores, and they are -
1) Low
Agreement - Scores from 0 to 4
2) Medium
Agreement - Scores of 5 or 6
3) High
Agreement - Scores of 7 or 8
4) Very High
Agreement - Scores of 9 or 10
So I'd like to count the number of responses per worker per score bin.
My Attempt
library(dplyr)
library(tidyr)
result=df %>%
gather(Item,response,-1) %>%
filter(!is.na(response)) %>%
group_by(Type,Item) %>%
filter(!is.na(Type)) %>%
summarise(Low=sum(response %in% c(0,1,2,3,4)),
Medium=sum(response %in% c(5,6)),
High=sum(response %in% c(7,8)),
VHigh=sum(response %in% c(9,10)) %>%
spread(Type,-Item)
My logic is I use tidyr
library and first gather
scores to compute the total responses. Then spread the columns, so that I have the sub-totals by worker and by score category.
For example, for Q1, a column for total responses for Low-Worker
, then Medium-Worker
, then High-Worker
, then Very High-Worker
, then Low-Factory Lead
, then Medium-Factory Lead
.... and so on for all combinations of employees and score categories.
Clearly something is not right in my code.
Desired Output
A dataframe with two rows (Q1
& Q2
) and 20 columns (for each employee-score combination).
Any help on this would be highly appreciated.
Create a data frame of scores
library(tidyr)
library(dplyr)
df <- data_frame(type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
scores <- data_frame(score = 0:10,
scorebin = c(rep("Low", 5),
rep("Medium", 2),
rep("High", 2),
rep("Very High", 2)))
Gather the data in long format. Join the scores data frame to add a scorebin
column. Group by item
, type
and scorebin
and count the number of answers under each group.
df2 <- df %>%
gather(item, score, -type) %>%
left_join(scores, by = "score") %>%
group_by(item, type, scorebin) %>%
summarise(n = n()) %>%
unite(employeescore, type, scorebin)
Change employeescore
to a factor with ordered levels
so that they are not displayed in alphabetical order (high, low, medium)
but in the correct order (low, medium, high).
employeescoreorder <- scores %>%
distinct(scorebin) %>%
merge(distinct(df, type)) %>%
unite(employeescore, type, scorebin)
df2$employeescore <- factor(df2$employeescore,
levels = employeescoreorder$employeescore)
Spread the data frame in wide format to obtain 20 columns.
df2 %>%
spread(employeescore, n)
# A tibble: 2 x 20
# Groups: item [2]
item `1_Low` `1_Medium` `1_High` `2_Low` `2_Medium` `2_High` `2_Very High` `4_Low`
* <chr> <int> <int> <int> <int> <int> <int> <int> <int>
1 q1 3 1 1 2 NA 1 1 NA
2 q2 1 1 3 2 2 1 NA 2
# ... with 11 more variables: `4_Medium` <int>, `4_High` <int>, `4_Very High` <int>,
# `5_High` <int>, `5_Very High` <int>, `3_Low` <int>, `3_Medium` <int>, `3_High` <int>,
# `3_Very High` <int>, NA_Low <int>, `<NA>` <int>