Search code examples
rregextidyrregex-lookarounds

R tidyr regex: extract ordered numbers from character column


Suppose I have a data frame like this

df <- data.frame(x=c("This script outputs 10 visualizations.", 
                     "This script outputs 1 visualization.", 
                     "This script outputs 5 data files.", 
                     "This script outputs 1 data file.", 
                     "This script doesn't output any visualizations or data files", 
                     "This script outputs 9 visualizations and 28 data files.", 
                     "This script outputs 1 visualization and 1 data file."))

It looks like this

                                                            x
1                      This script outputs 10 visualizations.
2                        This script outputs 1 visualization.
3                           This script outputs 5 data files.
4                            This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6     This script outputs 9 visualizations and 28 data files.
7        This script outputs 1 visualization and 1 data file.

Is there a simple way, possibly using the Tidyverse to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0. Essentially I would like the final result to be like this

    viz   files
1    10       0
2     1       0
3     0       5
4     0       1
5     0       0
6     9      28
7     1       1

I tried using stuff like

str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")

but I got so lost.


Solution

  • We can use regex lookaround in str_extract to extract one or more digits (\\d+) followed by a space and 'vis' or 'data files' into two columns

    library(dplyr)
    library(stringr)
    df %>% 
      transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
                files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
      mutate_all(replace_na, 0)
    #  viz files
    #1  10     0
    #2   1     0
    #3   0     5
    #4   0     0
    #5   0     0
    #6   9    28
    #7   1     0
    

    In the first case, the pattern matches one or more digits (\\d+) followed by a regex lookaround ((?=) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'