Suppose I have a data frame like this
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
It looks like this
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
Is there a simple way, possibly using the Tidyverse
to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0
. Essentially I would like the final result to be like this
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
I tried using stuff like
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
but I got so lost.
We can use regex lookaround in str_extract
to extract one or more digits (\\d+
) followed by a space and 'vis' or 'data files' into two columns
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
In the first case, the pattern matches one or more digits (\\d+
) followed by a regex lookaround ((?=
) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'