I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :
code label
61000-61003 excision of CNS
0169T-0169T ventricular shunt
The desired output should be :
code label
61000 excision of CNS
61001 excision of CNS
61002 excision of CNS
61003 excision of CNS
0169T ventricular shunt
My approach to this problem is using the package splitstackshape and using this code
library(data.table)
library(splitstackshape)
cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]
This approach leads to memory issues. Is there a better approach to this problem?
Some comments :
Here's a solution using dplyr
and all.is.numeric
from Hmisc
:
library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
rowwise %>%
mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
list(code1))) %>%
unnest(lists) %>%
select(code = lists, label)
Source: local data frame [5 x 2]
code label
(chr) (fctr)
1 61000 excision of CNS
2 61001 excision of CNS
3 61002 excision of CNS
4 61003 excision of CNS
5 0169T ventricular shunt
An edit to fix ranges with character values. Brings down the simplicity a little:
dff %>% mutate(row = row_number()) %>%
separate(code, into=c("code1", "code2")) %>%
group_by(row) %>%
summarise(lists = if(all.is.numeric(c(code1, code2)))
{list(str_pad(as.character(
seq(from = as.numeric(code1), to = as.numeric(code2))),
nchar(code1), pad="0"))}
else if(grepl("^[0-9]", code1))
{list(str_pad(paste0(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
strsplit(code1, "[0-9]+")[[1]][2]),
nchar(code1), pad = "0"))}
else
{list(paste0(
strsplit(code1, "[0-9]+")[[1]],
str_pad(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
nchar(gsub("[^0-9]", "", code1)), pad="0")))},
label = first(label)) %>%
unnest(lists) %>%
select(-row)
Source: local data frame [15 x 2]
label lists
(chr) (chr)
1 excision of CNS 61000
2 excision of CNS 61001
3 excision of CNS 61002
4 ventricular shunt 0169T
5 ventricular shunt 0170T
6 ventricular shunt 0171T
7 excision of CNS 01000
8 excision of CNS 01001
9 excision of CNS 01002
10 some procedure A2543
11 some procedure A2544
12 some procedure A2545
13 some procedure A0543
14 some procedure A0544
15 some procedure A0545
data:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")