Search code examples
rstringdataframedata-transform

extract integers from characters in R


I am in R. I want to extract just the numbers from df1. I have for example: df1 <- data.frame( column1 = c("Any[12, 15, 20]", "Any[22, 23, 30]"), column2 = c("Any[4, 17]", "Any[]"), stringsAsFactors = F )

And I want a new df, that takes the integers within the brackets muliples by the row number, and keeps the column information corresponding to it.

e.g. new_df could look like

Time Channel
12 column1
15 column1
20 column1
44 column1
46 column1
60 column1
8 column2
34 column2

I do not need to preserve any "NA" values, e.g If Any[] is empty. Anyone got any idea if this is possible please? I have ENORMOUS amounts of data in this format, so I cannot really do much manually. Cheers!

I already tried: new_df$Time <- as.integer(df1$column1) and that just gave blanks.

I also tried: new_df$Time <- str_extract_all(new_df$Time, "\\d+" ) %>% lapply(function(x) as.integer(x)) %>% sapply(function(x) if.else(length(x) >0, x, NA) )

which only then returned the first integer within each bracket. e.g.

Time Channel
12 column1
44 column1
8 column2

Solution

  • library(dplyr)
    library(purrr)
    library(stringr)
    library(tidyr)
    
    df1|>
      mutate(across(everything(), \(x) imap(str_extract_all(x, "\\d+"), ~ as.numeric(.x) * .y))) |>
      pivot_longer(everything(), cols_vary = "slowest", names_to = "Time", values_to = "Channel") |>
      unnest_longer(Channel)
    

    How it works

    This is similar to the approach you took in your initial attempt except I am using purrr::imap instead of lapply. The advantage here is that imap gives you access to the list element name (.y), which is the row number in this case, in addition to the list element (.x). This makes the multiple multiplication step simple.

    str_extract_all extracts all the numbers from a column and outputs those numbers in a list:

    str_extract_all(df1$column1, "\\d+")
    [[1]]
    [1] "12" "15" "20"
    
    [[2]]
    [1] "22" "23" "30"
    

    imap iterates over this list and does the multiplication:

    imap(str_extract_all(df1$column1, "\\d+"), ~ as.numeric(.x) * .y)
    [[1]]
    [1] 12 15 20
    
    [[2]]
    [1] 44 46 60
    

    Then the other two pipes are to reshape the data.

    Output

      Time    Channel
      <chr>     <dbl>
    1 column1      12
    2 column1      15
    3 column1      20
    4 column1      44
    5 column1      46
    6 column1      60
    7 column2       4
    8 column2      17