Search code examples
rcharactermultiple-columnsdummy-variable

R - Finding and combining multiple character columns


In short, I kind of want to do the reverse process of model.matrix(). I have the following data obtained from a survey.

Question 1: pick one from 'Cat', 'Dog' or 'Sheep'
Question 2: pick a random number
Question 3: pick one from 'Big' or 'Small'

The results are put in a dataframe in the following way:

    id   ans_1   ans_2   ans_3   num    size_1   size_2
    1    Cat     0       0       0.76   0        Small 
    2    0       Dog     0       0.44   0        Small 
    3    0       0       Sheep   0.52   Big      0     
    4    0       0       Sheep   0.52   Big      0     
    5    0       Dog     0       0.59   0        Small 
    6    Cat     0       0       0.97   0        Small 
    7    0       Dog     0       0.5    0        Small 
    8    0       Dog     0       0.19   0        Small 
    9    0       0       Sheep   0.01   Big      0     
   10    Cat     0       0       0.24   0        Small 

What I want to do is to combine the answers of every question in a single column. In this example, ans_1, ans_2 and ans_3 are from the same question and size_1 and size_2 as well. Combining them, the results would look like:

   id    ans     num    size  
    1    Cat     0.76   Small        
    2    Dog     0.44   Small        
    3    Sheep   0.52   Big          
    4    Sheep   0.52   Big          
    5    Dog     0.59   Small        
    6    Cat     0.97   Small        
    7    Dog     0.5    Small        
    8    Dog     0.19   Small       
    9    Sheep   0.01   Big         
   10    Cat     0.24   Small        

I have multiple datasets, each of which contains around 100 columns. This makes it too much work to do by hand. Note that there are also columns like 'num', which stand by themselves. The answers to the same question are always right next to each other like in this example.

Thank you!

Data

id = 1:10
ans_1 = c('Cat', 0, 0, 0, 0, 'Cat', 0, 0, 0, 'Cat')
ans_2 = c(0, 'Dog', 0, 0, 'Dog', 0, 'Dog', 'Dog', 0, 0)
ans_3 = c(0, 0, 'Sheep', 'Sheep', 0, 0, 0, 0, 'Sheep', 0)
num = round(runif(10),2)
size_1 = c(0, 0, 'Big', 'Big', 0, 0, 0, 0, 'Big', 0)
size_2 = c('Small', 'Small', 0, 0, 'Small', 'Small', 'Small', 'Small', 0, 
'Small')

data <- noquote(cbind(id, ans_1, ans_2, ans_3, num, size_1, size_2))

Solution

  • You might like to use unite() from tidyr, and then mutate_if() from dplyr:

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    data <- data_frame(id, ans_1, ans_2, ans_3, num, size_1, size_2)
    
    data %>% 
        unite(ans, ans_1, ans_2, ans_3) %>% 
        unite(size, size_1, size_2) %>%
        mutate_if(is.character, str_extract, "[a-zA-Z]+")
    
    #> # A tibble: 10 × 4
    #>       id   ans   num  size
    #>    <int> <chr> <dbl> <chr>
    #> 1      1   Cat  0.92 Small
    #> 2      2   Dog  0.79 Small
    #> 3      3 Sheep  0.44   Big
    #> 4      4 Sheep  0.67   Big
    #> 5      5   Dog  0.00 Small
    #> 6      6   Cat  0.61 Small
    #> 7      7   Dog  0.67 Small
    #> 8      8   Dog  0.95 Small
    #> 9      9 Sheep  0.18   Big
    #> 10    10   Cat  0.76 Small