Search code examples
rstringdataformat

Extracting digit from character string and transforming it to a number


Let's say I have a data frame with vectors A:E where vector E looks like this:

ABCDEF50GH
ABCDEF600GH
ABCDEF50GH
ABCDEF1000GH

Part of my code looks like this:

DF <- (filter(DF1, A == "AH") %>%
         mutate(B = nchar(E),
                C = case_when(D == "X" ~ "0",
                              B == 10 ~ substr(E, 7, 8),
                              B == 11 ~ substr(E, 7, 9),
                              B == 12 ~ substr(E, 7, 10),
                              TRUE ~ "0")))

So I try to extract a number from a string. The problem is, extracted number is a character not a number so i need to make other arguments of case_when as characters too. Therefore vector C is a character vector and when I try to transform it to numeric:

transform(DF, C = as.numeric(levels(C))[C])

I get a vector with NAs instead of numbers.

Pls help


Solution

  • Using stringr to extract digits and then simply transform the outcome to a numeric vector:

    library(dplyr)
    library(stringr)
    
    sample.df <- data.frame(E = c(
      "ABCDEF50GH",
      "ABCDEF600GH",
      "ABCDEF50GH",
      "ABCDEF1000GH"
    ), 
    stringsAsFactors = FALSE)
    
    sample.df <- sample.df %>%
      mutate(E_numbers = str_extract_all(E, "[[:digit:]]+")) %>%
      mutate(E_numbers = unlist(E_numbers)) %>% 
      mutate(E_numbers = as.numeric(E_numbers))
    
    > sample.df
                 E E_numbers
    1   ABCDEF50GH        50
    2  ABCDEF600GH       600
    3   ABCDEF50GH        50
    4 ABCDEF1000GH      1000
    

    str_extract_all() returns a list which can be tricky to handle, therefore I use unlist() other than that, it should be straightforward :)

    Note: the difference between str_extract_all() and str_extract() is that str_extract() will only catch the first number in your strings. So if one of the strings in E was "ABCDEF600G400H", str_extract_all() would return the numbers 600 and 400 while str_extract() would return 600. Not sure what is preferable in your case.

    Edit: If you want to extract only the last number in "ABCDEF600G400H" we can use the stringi package instead of stringr:

    library(dplyr)
    library(stringi)
    
    sample.df <- data.frame(
      E = c(
        "ABCDEF50GH",
        "ABCDEF600GH",
        "ABCDEF50GH",
        "ABCDEF1000GH",
        "ABCDEF600G400H"
      ), stringsAsFactors = FALSE)
    
    sample.df <- sample.df %>%
      mutate(E_numbers = stri_extract_last_regex(E, "[[:digit:]]+")) %>%
      mutate(E_numbers = unlist(E_numbers)) %>% 
      mutate(E_numbers = as.numeric(E_numbers))
    > sample.df
                   E E_numbers
    1     ABCDEF50GH        50
    2    ABCDEF600GH       600
    3     ABCDEF50GH        50
    4   ABCDEF1000GH      1000
    5 ABCDEF600G400H       400