I need to clean data where the variable property and answer associated with a location are together in a single cell. The only thing consistent in my dataset is that they are separated by a colon (:). I need to remap the data to the variable property becomes the column header and the data is mapped for each Location.
I've attached an example:
There can also be a bunch of other symbols that are irrelevant. I just need to extract the string before the colon and the string or integer after the colon and it is mapped correctly for each location.
How do I do this on R? What functions should I be using
Example data:
Example1 Sunny:"TRUE" NearCoast:False Schools:{"13"} 2
Example2 NearCoast:False Schools:{"6"} Sunny:"FALSE" 3
Example3 Schools:{"2"} Sunny:"TRUE" NearCoast:TRUE Transport:5
Also, would it be possible that I could add exceptions to this process. For example, if the cell is simply a number alone, it is ignored. Or, if the property name is a specific thing such as "transport", it ignores the cell too.
Try this example, as mentioned in comments, we can reshape wide-to-long, then string split on :
, then again reshape long-to-wide.
df1 <- read.table(text = '
Example1 Sunny:"TRUE" NearCoast:False Schools:{"13"} 2
Example2 NearCoast:False Schools:{"6"} Sunny:"FALSE" 3
Example3 Schools:{"2"} Sunny:"TRUE" NearCoast:TRUE Transport:5',
header = FALSE, stringsAsFactors = FALSE)
library(tidyverse)
gather(df1, key = "k", value = "v", -V1) %>%
separate(v, into = c("type", "value"), sep = ":") %>%
filter(!is.na(value)) %>%
select(-k) %>%
spread(key = type, value = value)
# V1 NearCoast Schools Sunny Transport
# 1 Example1 False {"13"} "TRUE" <NA>
# 2 Example2 False {"6"} "FALSE" <NA>
# 3 Example3 TRUE {"2"} "TRUE" 5