Search code examples
rregexdplyrtidyversereshape2

Split Column, convert resulting columns to factors


My goal is to split a charcter column consisting of 1's and 0's into their own columns. I want the new columns to be of type "factor". I have found a nice way to split the column (using dplyr's "mutate" and reshape2's "colsplit), but have not been able to find an efficent way to have all the resulting columns be of type "factor".

Here is a minimal example of my problem:

library(dplyr)
library(reshape2)
# Data frame to be processed
df = tribble(
  ~x, ~y, ~z,
  "Alpha", "1111", "Alp",
  "Beta", "1001", "Bet"
)

# Vector Containing Names for columns
names = c("A", "B", "C", "D")

df %>%
  mutate_at("y", colsplit, names = names, pattern = "") 

Output:

# A tibble: 2 x 3
  x       y$A    $B    $C    $D z    
  <chr> <int> <int> <int> <int> <chr>
1 Alpha     1     1     1     1 Alp  
2 Beta      1     0     0     1 Bet 

This example results in the proper table, except that I want the new columns to be factors (needed for a downsteam script). I would also prefer that the new columns not be named with information about the old column, which I believe is a feature of colsplit -- referring to the y$ and the $ portions.

Currently, the only way I can make all the columns be factors is to manually edit them, which is rather inefficent.


Other Solution I tried:

I have also tried to solve this problem using separate from the tidyverse, but could not get it to divide properly. I do not understand what regex it is using. For example, this code:

df %>%
  separate("y", into = names, sep = "")

Results in this:

# A tibble: 2 x 6
  x     A     B     C     D     z    
  <chr> <chr> <chr> <chr> <chr> <chr>
1 Alpha ""    1     1     1     Alp  
2 Beta  ""    1     0     0     Bet  

Which appears to be selecting the first blank before the string? Im not quite sure what is happening with that.

Also, In reality, the character string in column y can be a different length (but will be consistent size within the dataset being processed -- for example, y could be 100 characters long, and it would be 100 long in every row).


Solution

  • Here is a way using dplyr & tidyr

    library(dplyr)
    library(tidyr)
    
    # Create a names vector that dynamic base on length of y index from 1 to max length
    # As "" feed to separate as separator so the first matched is an empty char
    # for this case we add a column to be dropped later into the names list.
    names <- c("drop", seq_len(max(sapply(df$y, nchar))))
    
    df %>%
      separate("y", into = names, sep = "", fill = "warn") %>%
      mutate_if(.predicate = is.character, .funs = factor) %>%
      select(-drop)
    #> # A tibble: 2 x 6
    #>   x     `1`   `2`   `3`   `4`   z    
    #>   <fct> <fct> <fct> <fct> <fct> <fct>
    #> 1 Alpha 1     1     1     1     Alp  
    #> 2 Beta  1     0     0     1     Bet
    

    Or another way that only convert columns appear in names to factor

    df %>%
      separate("y", into = names, sep = "") %>%
      mutate_at(vars(one_of(names)), .funs = factor) %>%
      select(-drop)
    #> # A tibble: 2 x 6
    #>   x     `1`   `2`   `3`   `4`   z    
    #>   <chr> <fct> <fct> <fct> <fct> <chr>
    #> 1 Alpha 1     1     1     1     Alp  
    #> 2 Beta  1     0     0     1     Bet
    

    Created on 2021-05-15 by the reprex package (v2.0.0)