I am working on a data analysis project where I was given a very messy data set. To contextualize, the data is for mosquito surveillance. One of my columns is a series of numbers seperated by " ". Each of the numbers in these strings represent a different type of container that was tested for the presence of absence of mosquitos. For example, these are 4 examples of entries in this column, "1 3 4 5", "1 2 5 888", "1 888", and "2 3 888". There are 6 different numbers used throughout this column (1, 2, 3, 4, 5, and 888). I do not want to enlogate my dataset any further so I am hoping to create 6 different binary indicator columns to mark the presence or absence of each of these container types for each entry. I am fairly new to R so any suggestions or tips that you might have would be greatly appreciated!
For reference, this is the closest that I have gotten to what I'm looking for:
HHAnalysis$container_sondeo_sites <- str_split(HHAnalysis$container_sondeo_sites, " ", 6, TRUE)
However, because the numbers do not follow the same order for each entry, I have numbers in the wrong column. For example, I have 5's in my 2 column, 888's in my 1 column, etc. I appologize if my explaination is confusing. This is my frist entry and I am trying to figure out how to best convey my problem. Thanks in advance!
Not exactly sure what you need, but this might get you going:
library(tidyr)
df %>%
# split `string` into separate values:
separate_rows(string) %>%
# from these values create new columns each:
pivot_wider(names_from = string, values_from = string,
# create binary indicators:
values_fn = function(x) 1, values_fill = 0)
# A tibble: 4 × 7
someVar `1` `3` `4` `5` `2` `888`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 1 1 1 0 0
2 B 1 0 0 1 1 1
3 C 1 0 0 0 0 1
4 D 0 1 0 0 1 1
Data:
df <- data.frame(
someVar = LETTERS[1:4],
string = c("1 3 4 5", "1 2 5 888", "1 888", "2 3 888")
)