r data.table medical data-cleaning splitstackshape

Splitting a single column into multiple observation using R

I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :

code            label
61000-61003     excision of CNS
0169T-0169T     ventricular shunt

The desired output should be :

code            label
61000           excision of CNS
61001           excision of CNS
61002           excision of CNS
61003           excision of CNS
0169T           ventricular shunt

My approach to this problem is using the package splitstackshape and using this code

library(data.table)
library(splitstackshape)

cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]

This approach leads to memory issues. Is there a better approach to this problem?

Some comments :

The data has many letters apart from "T".
The letter can be either in the front or at the very end but not in between two numbers.
There is no change of letter from "T" to "U" in one single range

Solution

Here's a solution using dplyr and all.is.numeric from Hmisc:

library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
        rowwise %>%
        mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
                         list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
                         list(code1))) %>%
        unnest(lists) %>%
        select(code = lists, label)

Source: local data frame [5 x 2]

   code             label
  (chr)            (fctr)
1 61000   excision of CNS
2 61001   excision of CNS
3 61002   excision of CNS
4 61003   excision of CNS
5 0169T ventricular shunt

An edit to fix ranges with character values. Brings down the simplicity a little:

dff %>% mutate(row = row_number()) %>%
        separate(code, into=c("code1", "code2")) %>%
        group_by(row) %>%
        summarise(lists = if(all.is.numeric(c(code1, code2)))
                              {list(str_pad(as.character(
                                   seq(from = as.numeric(code1), to = as.numeric(code2))),
                                         nchar(code1), pad="0"))}
                          else if(grepl("^[0-9]", code1))
                              {list(str_pad(paste0(as.character(
                                   seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                      strsplit(code1, "[0-9]+")[[1]][2]),
                                         nchar(code1), pad = "0"))}
                          else
                              {list(paste0(
                                      strsplit(code1, "[0-9]+")[[1]],
                                      str_pad(as.character(
                                    seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                         nchar(gsub("[^0-9]", "", code1)), pad="0")))},
                   label = first(label)) %>%
        unnest(lists) %>%
        select(-row)
Source: local data frame [15 x 2]

               label lists
               (chr) (chr)
1    excision of CNS 61000
2    excision of CNS 61001
3    excision of CNS 61002
4  ventricular shunt 0169T
5  ventricular shunt 0170T
6  ventricular shunt 0171T
7    excision of CNS 01000
8    excision of CNS 01001
9    excision of CNS 01002
10    some procedure A2543
11    some procedure A2544
12    some procedure A2545
13    some procedure A0543
14    some procedure A0544
15    some procedure A0545

data:

dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
"excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
"label"), row.names = c(NA, 5L), class = "data.frame")