Search code examples
rdata.tablemedicaldata-cleaningsplitstackshape

Splitting a single column into multiple observation using R


I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference :

code            label
61000-61003     excision of CNS
0169T-0169T     ventricular shunt

The desired output should be :

code            label
61000           excision of CNS
61001           excision of CNS
61002           excision of CNS
61003           excision of CNS
0169T           ventricular shunt

My approach to this problem is using the package splitstackshape and using this code

library(data.table)
library(splitstackshape)

cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]

This approach leads to memory issues. Is there a better approach to this problem?

Some comments :

  • The data has many letters apart from "T".
  • The letter can be either in the front or at the very end but not in between two numbers.
  • There is no change of letter from "T" to "U" in one single range

Solution

  • Here's a solution using dplyr and all.is.numeric from Hmisc:

    library(dplyr)
    library(Hmisc)
    library(tidyr)
    dat %>% separate(code, into=c("code1", "code2")) %>%
            rowwise %>%
            mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
                             list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
                             list(code1))) %>%
            unnest(lists) %>%
            select(code = lists, label)
    
    Source: local data frame [5 x 2]
    
       code             label
      (chr)            (fctr)
    1 61000   excision of CNS
    2 61001   excision of CNS
    3 61002   excision of CNS
    4 61003   excision of CNS
    5 0169T ventricular shunt
    

    An edit to fix ranges with character values. Brings down the simplicity a little:

    dff %>% mutate(row = row_number()) %>%
            separate(code, into=c("code1", "code2")) %>%
            group_by(row) %>%
            summarise(lists = if(all.is.numeric(c(code1, code2)))
                                  {list(str_pad(as.character(
                                       seq(from = as.numeric(code1), to = as.numeric(code2))),
                                             nchar(code1), pad="0"))}
                              else if(grepl("^[0-9]", code1))
                                  {list(str_pad(paste0(as.character(
                                       seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                          strsplit(code1, "[0-9]+")[[1]][2]),
                                             nchar(code1), pad = "0"))}
                              else
                                  {list(paste0(
                                          strsplit(code1, "[0-9]+")[[1]],
                                          str_pad(as.character(
                                        seq(from = extract_numeric(code1), to = extract_numeric(code2))),
                                             nchar(gsub("[^0-9]", "", code1)), pad="0")))},
                       label = first(label)) %>%
            unnest(lists) %>%
            select(-row)
    Source: local data frame [15 x 2]
    
                   label lists
                   (chr) (chr)
    1    excision of CNS 61000
    2    excision of CNS 61001
    3    excision of CNS 61002
    4  ventricular shunt 0169T
    5  ventricular shunt 0170T
    6  ventricular shunt 0171T
    7    excision of CNS 01000
    8    excision of CNS 01001
    9    excision of CNS 01002
    10    some procedure A2543
    11    some procedure A2544
    12    some procedure A2545
    13    some procedure A0543
    14    some procedure A0544
    15    some procedure A0545
    

    data:

    dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002", 
    "A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt", 
    "excision of CNS", "some procedure", "some procedure")), .Names = c("code", 
    "label"), row.names = c(NA, 5L), class = "data.frame")