Search code examples
rsplitstringr

Splitting a numeric string column into a series of indicator columns


I am working on a data analysis project where I was given a very messy data set. To contextualize, the data is for mosquito surveillance. One of my columns is a series of numbers seperated by " ". Each of the numbers in these strings represent a different type of container that was tested for the presence of absence of mosquitos. For example, these are 4 examples of entries in this column, "1 3 4 5", "1 2 5 888", "1 888", and "2 3 888". There are 6 different numbers used throughout this column (1, 2, 3, 4, 5, and 888). I do not want to enlogate my dataset any further so I am hoping to create 6 different binary indicator columns to mark the presence or absence of each of these container types for each entry. I am fairly new to R so any suggestions or tips that you might have would be greatly appreciated!

For reference, this is the closest that I have gotten to what I'm looking for:

HHAnalysis$container_sondeo_sites <- str_split(HHAnalysis$container_sondeo_sites, " ", 6, TRUE)

However, because the numbers do not follow the same order for each entry, I have numbers in the wrong column. For example, I have 5's in my 2 column, 888's in my 1 column, etc. I appologize if my explaination is confusing. This is my frist entry and I am trying to figure out how to best convey my problem. Thanks in advance!


Solution

  • Not exactly sure what you need, but this might get you going:

    library(tidyr)
    df %>%
      # split `string` into separate values:
      separate_rows(string) %>%
      # from these values create new columns each:
        pivot_wider(names_from = string, values_from = string,
                    # create binary indicators:
                    values_fn = function(x) 1, values_fill = 0)
    # A tibble: 4 × 7
      someVar   `1`   `3`   `4`   `5`   `2` `888`
      <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 A           1     1     1     1     0     0
    2 B           1     0     0     1     1     1
    3 C           1     0     0     0     0     1
    4 D           0     1     0     0     1     1
    

    Data:

    df <- data.frame(
      someVar = LETTERS[1:4],
      string = c("1 3 4 5", "1 2 5 888", "1 888", "2 3 888")
    )