Search code examples
rtidyrdata-manipulationspread

R tidyr spread columns across all categories of a given variable


I'm working on a dataset that looks like this.

#Dataframe
df=data.frame(Type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
          Q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
          Q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))

Context

The dataframe consists of results from a questionnaire.

The first column, Type, refers to the type of employee who answered the questionnaire, where 1 = 'Worker', 2 = 'Factory Lead', 3 = 'Administrative Staff', 4 = 'Middle Management' & 5 = 'Executive'

The second and third column (Q1 & Q2) are questions, rated on a scale of 1 = 'Strongly Agree' to 10 (Strongly Disagree).

What I'm Trying to Achieve

I'd like to compute the total number of responses per Type, based on score. I have created bins for the scores, and they are -

1) Low Agreement - Scores from 0 to 4

2) Medium Agreement - Scores of 5 or 6

3) High Agreement - Scores of 7 or 8

4) Very High Agreement - Scores of 9 or 10

So I'd like to count the number of responses per worker per score bin.

My Attempt

library(dplyr)
library(tidyr)

result=df %>%
gather(Item,response,-1) %>%
filter(!is.na(response)) %>%
group_by(Type,Item) %>%
filter(!is.na(Type)) %>%
summarise(Low=sum(response %in% c(0,1,2,3,4)),
        Medium=sum(response %in% c(5,6)),
        High=sum(response %in% c(7,8)),
        VHigh=sum(response %in% c(9,10)) %>%
spread(Type,-Item)

My logic is I use tidyr library and first gather scores to compute the total responses. Then spread the columns, so that I have the sub-totals by worker and by score category.

For example, for Q1, a column for total responses for Low-Worker, then Medium-Worker, then High-Worker, then Very High-Worker, then Low-Factory Lead, then Medium-Factory Lead .... and so on for all combinations of employees and score categories.

Clearly something is not right in my code.

Desired Output

A dataframe with two rows (Q1 & Q2) and 20 columns (for each employee-score combination).

Any help on this would be highly appreciated.


Solution

  • Create a data frame of scores

    library(tidyr)
    library(dplyr)
    df <- data_frame(type=c(1,2,4,5,4,3,3,4,5,1,2,3,2,1,2,3,3,2,1,1,NA),
                     q1=c(1,2,6,8,9,10,2,6,7,4,9,9,1,2,NA,4,3,8,7,6,4),
                     q2=c(1,2,4,NA,8,2,1,2,10,7,5,5,5,8,2,7,4,8,7,5,1))
    
    scores <- data_frame(score = 0:10,
                         scorebin = c(rep("Low", 5),
                                      rep("Medium", 2),
                                      rep("High", 2),
                                      rep("Very High", 2)))
    

    Gather the data in long format. Join the scores data frame to add a scorebin column. Group by item, type and scorebin and count the number of answers under each group.

    df2 <- df %>%
        gather(item, score, -type) %>% 
        left_join(scores, by = "score") %>% 
        group_by(item, type, scorebin) %>% 
        summarise(n = n()) %>% 
        unite(employeescore, type, scorebin)
    

    Change employeescore to a factor with ordered levels so that they are not displayed in alphabetical order (high, low, medium) but in the correct order (low, medium, high).

    employeescoreorder <- scores %>% 
        distinct(scorebin) %>% 
        merge(distinct(df, type)) %>% 
        unite(employeescore, type, scorebin)
    df2$employeescore <- factor(df2$employeescore, 
                                levels = employeescoreorder$employeescore)
    

    Spread the data frame in wide format to obtain 20 columns.

    df2 %>% 
        spread(employeescore, n)
    
    # A tibble: 2 x 20
    # Groups:   item [2]
       item `1_Low` `1_Medium` `1_High` `2_Low` `2_Medium` `2_High` `2_Very High` `4_Low`
    * <chr>   <int>      <int>    <int>   <int>      <int>    <int>         <int>   <int>
    1    q1       3          1        1       2         NA        1             1      NA
    2    q2       1          1        3       2          2        1            NA       2
    # ... with 11 more variables: `4_Medium` <int>, `4_High` <int>, `4_Very High` <int>,
    #   `5_High` <int>, `5_Very High` <int>, `3_Low` <int>, `3_Medium` <int>, `3_High` <int>,
    #   `3_Very High` <int>, NA_Low <int>, `<NA>` <int>