Search code examples
rsplitsentence

How can I split sentence into new variables in R (with zero-one encoding)?


I have a data like below:

V1  V2
1   orange, apple
2   orange, lemon
3   lemon, apple
4   orange, lemon, apple
5   lemon
6   apple
7   orange
8   lemon, apple

I want to split the V2 variable like this:

  • I have three categories of the V2 column: "orange", "lemon", "apple"
  • for each of the categories I want to create a new column (variable) that will inform about whether such a name appeared in V2 (0,1)

I tried this

df %>% separate(V2, into = c("orange", "lemon", "apple"))

.. and I got this result, but it's not what I expect.

  V1 orange lemon apple
1  1   orange   apple    <NA>
2  2   orange   lemon    <NA>
3  3    lemon   apple    <NA>
4  4   orange   lemon   apple
5  5    lemon    <NA>    <NA>
6  6    apple    <NA>    <NA>
7  7   orange    <NA>    <NA>
8  8    lemon   apple    <NA>

The result I mean is below.

V1  orange  lemon   apple
1   1   0   1
2   1   1   0
3   0   1   1
4   1   1   0
5   0   1   0
6   0   0   1
7   1   0   0
8   0   1   1

Solution

  • you could try pivoting:

    library(dplyr)
    library(tidyr)
    df |> 
      separate_rows(V2, sep = ", ") |> 
      mutate(ind = 1) |> 
      pivot_wider(names_from = V2,
                  values_from = ind,
                  values_fill = 0)
    
    

    Output is:

    # A tibble: 8 × 4
         V1 orange apple lemon
      <int>  <dbl> <dbl> <dbl>
    1     1      1     1     0
    2     2      1     0     1
    3     3      0     1     1
    4     4      1     1     1
    5     5      0     0     1
    6     6      0     1     0
    7     7      1     0     0
    8     8      0     1     1
    

    data I used:

    V1 <- 1:8
    V2 <- c("orange, apple", "orange, lemon", 
            "lemon, apple", "orange, lemon, apple",
            "lemon", "apple", "orange", 
            "lemon, apple")
    df <- tibble(V1, V2)