I am in R. I want to extract just the numbers from df1.
I have for example:
df1 <- data.frame( column1 = c("Any[12, 15, 20]", "Any[22, 23, 30]"), column2 = c("Any[4, 17]", "Any[]"), stringsAsFactors = F )
And I want a new df, that takes the integers within the brackets muliples by the row number, and keeps the column information corresponding to it.
e.g. new_df could look like
Time | Channel |
---|---|
12 | column1 |
15 | column1 |
20 | column1 |
44 | column1 |
46 | column1 |
60 | column1 |
8 | column2 |
34 | column2 |
I do not need to preserve any "NA" values, e.g If Any[] is empty. Anyone got any idea if this is possible please? I have ENORMOUS amounts of data in this format, so I cannot really do much manually. Cheers!
I already tried:
new_df$Time <- as.integer(df1$column1)
and that just gave blanks.
I also tried:
new_df$Time <- str_extract_all(new_df$Time, "\\d+" ) %>% lapply(function(x) as.integer(x)) %>% sapply(function(x) if.else(length(x) >0, x, NA) )
which only then returned the first integer within each bracket. e.g.
Time | Channel |
---|---|
12 | column1 |
44 | column1 |
8 | column2 |
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
df1|>
mutate(across(everything(), \(x) imap(str_extract_all(x, "\\d+"), ~ as.numeric(.x) * .y))) |>
pivot_longer(everything(), cols_vary = "slowest", names_to = "Time", values_to = "Channel") |>
unnest_longer(Channel)
How it works
This is similar to the approach you took in your initial attempt except I am using purrr::imap
instead of lapply
. The advantage here is that imap
gives you access to the list element name (.y
), which is the row number in this case, in addition to the list element (.x
). This makes the multiple multiplication step simple.
str_extract_all
extracts all the numbers from a column and outputs those numbers in a list:
str_extract_all(df1$column1, "\\d+")
[[1]]
[1] "12" "15" "20"
[[2]]
[1] "22" "23" "30"
imap
iterates over this list and does the multiplication:
imap(str_extract_all(df1$column1, "\\d+"), ~ as.numeric(.x) * .y)
[[1]]
[1] 12 15 20
[[2]]
[1] 44 46 60
Then the other two pipes are to reshape the data.
Output
Time Channel
<chr> <dbl>
1 column1 12
2 column1 15
3 column1 20
4 column1 44
5 column1 46
6 column1 60
7 column2 4
8 column2 17