Search code examples
rtidyversestringrstringi

Parsing 'Table of Contents' to get the correct page numbers


Here is a table of content:

df <- tibble(ToC=
             c("3.1 texta.............. 22",
             "3.2 textb     25",
             "section 6 ................. 50",
             "section 10.2       65"))

I want to extract the contents and their respective page numbers as two variables. I tried the following, but it's not working correctly.

library(tidyverse); library(stringr)
df_toc <- df %>%
  mutate(page = as.numeric(str_extract(ToC, "[0-9]+")))

The correct page numbers should be 22, 25, 50, and 65. How should I solve this?


Solution

  • Try this (digits at the end of a line):

    df %>% 
      mutate(page = as.numeric(str_extract(ToC, "\\d+$")))