Search code examples
rstringrmutate

Expand a range of values given as a string to a comma separated string of values


A column of strings contains one or more numbers separated by commas or by a range (specifically, something like 5551-6 meaning 5551 through 5556). I need to expand the range to be in the same comma separated form as the other rows.

I have a work-around, but I am not happy with the complexity of it.

Data Source

I have a character column of NAICS codes from this table. Which is the sheet named "Table 2.11" of the excel file available on the linked page.

The problem can be reproduced using the following sample data:

data0 <- tibble::tibble("2022 NAICS" = c("31-33", "331", "332, 333", "6113", "6114-7", "6118"))

Problem

Some of the rows contain multiple codes (for instance, 313, 314). This is fine. I can separate these each into their own row with separate_longer_delim(NAICS, delim = ", "). One row contains 6114-7, which would be the NAICS codes 6114 through 6117. I would like to convert 6114-7 to the same form as the other rows. Since there is only one instance of the problem, I know I could explicitly replace that string, but I would like to do it dynamically.

Desired Result

Convert NAICS codes given as a range (6114-7) to comma separated values as a string (6114, 6115, 6116, 6117).

Attempt

I have a work-around, but I'm sure there is a more elegant solution. Ideally a solution that can also handle a different number of digits, such as turning 31-33 into 31, 32, 33, which also appears in the data. I removed them in the attempt below.

The thought process in the mutate below is to find any rows similar to 6114-7, and if found, paste a string which is formed by collapsing a sequence whose endpoints are extracted from the original value. If a match is not found, NAICS is set equal to itself, no changes are made.

library(tidyverse)

data1 <- data0 |>
    rename(NAICS = "2022 NAICS") |>
    filter(!is.na(NAICS), NAICS != "NA") |>
    filter(str_detect(NAICS,"(\\d\\d)-(\\d\\d)",negate = TRUE)) |>
    rowwise() |>
    mutate(
        NAICS = ifelse(
            str_detect(NAICS, "(\\d+)-(\\d)"),
            paste(
                as.numeric(
                    str_extract(NAICS, "(\\d+)", group = 1)
                ):(
                    round(
                        as.numeric(
                            str_extract(NAICS, "(\\d+)", group = 1)
                        ),
                        digits = -1
                    ) +
                        as.numeric(
                            str_extract(NAICS, "-(\\d)", group = 1)
                        )
                ),
                collapse = ", "
            ),
            NAICS
        )
    )

# using data0 provided above, data1 is:
# A tibble: 5 × 1
# Rowwise: 
#  NAICS                 
#  <chr>                 
# 1 331                   
# 2 332, 333              
# 3 6113                  
# 4 6114, 6115, 6116, 6117
# 5 6118  

Solution

  • It seems to me that if the second number in a pair separated by a - is less than 10, we can assume that it is supposed to represent a value with the same number of 10s as the first value. Otherwise, it is treated as-is. For example 975-6 would represent 975, 976 and 51-5 would represent 51, 52, 53, 54, 55, but the latter could also be represented with 51-55.

    That being the case, we can do:

    unlist(lapply(strsplit(data0$`2022 NAICS`, '-'), \(x) {
      if(length(x) == 2) {
        x <- as.numeric(x)
        if(x[2] < 10) x[2] <- 10 * floor(x[1]/10) + x[2]
        paste(seq(as.numeric(x[1]), as.numeric(x[2])), collapse = ', ')
      } else x
    }))
    #> [1] "31, 32, 33"             "331"                    "332, 333"              
    #> [4] "6113"                   "6114, 6115, 6116, 6117" "6118"