Search code examples
rdplyrfrequencydata-analysis

Analysing Multiple Choice Values in R - Getting Frequency count of Variables


I imported the following data in which one of the questions was structured like this:

Question a) Type_of_input [MULTIPLE SELECT]

  1. 1:Fertiliser
  2. 2:Pesticide
  3. 3:Herbicide
  4. 4:Sprayer

Question b) Total Volume.

In the data.frame, the data is split in a matrix with each of the options being in separate columns and the observations being 0 and 1. 1 is if the option was selected and 0 if the option wasn't. see below a mock up of the data.frame.

Type_of_input <- c("1:Fertiliser|2:Pesticide|4:Sprayer", "2:Pesticide|3:Herbicides", "2:Pesticide|3:Herbicide|4:Sprayer")
Fertiliser <- c(1,0,0)
Pesticide <- c(1,1,1)
Herbicide <- c(0,1,1)
Sprayer <- c(1,0,1)
total_volume <- c(40,50,60)
df_inputs <- data.frame(Type_of_input, Fertiliser, Pesticide, Herbicide, Sprayer, volume)

df_inputs

                       Type_of_input Fertiliser Pesticide Herbicide Sprayer total_volume
1 1:Fertiliser|2:Pesticide|4:Sprayer          1         1         0       1           40
2           2:Pesticide|3:Herbicides          0         1         1       0           50
3  2:Pesticide|3:Herbicide|4:Sprayer          0         1         1       1           60

How do I get a frequency table count of each of the inputs and their total_volume?


Solution

  • This is one solution, where you calculate the sum for each column of interest

    Type_of_input <- c("1:Fertiliser|2:Pesticide|4:Sprayer", "2:Pesticide|3:Herbicides", "2:Pesticide|3:Herbicide|4:Sprayer")
    Fertiliser <- c(1,0,0)
    Pesticide <- c(1,1,1)
    Herbicide <- c(0,1,1)
    Sprayer <- c(1,0,1)
    df_inputs <- data.frame(Type_of_input, Fertiliser, Pesticide, Herbicide, Sprayer)
    
    library(dplyr)
    
    df_inputs %>%
      select(-Type_of_input) %>%
      summarise_all(sum)
    
    #    Fertiliser Pesticide Herbicide Sprayer
    # 1          1         3         2       2
    

    You can have a different format like this

    library(tidyverse)
    
    df_inputs %>%
      select(-Type_of_input) %>%
      summarise_all(sum) %>%
      gather(var, value) %>%
      arrange(desc(value))
    
    #          var value
    # 1  Pesticide     3
    # 2  Herbicide     2
    # 3    Sprayer     2
    # 4 Fertiliser     1
    

    in case you want to use value variable to arrange your dataset and have the most popular values on top.