Search code examples
htmlrtextextract

extracting <h2> title text from html where title text might include newlines


I have an html file with some <h2> tags such as

a <- '<section id="sec-standard-stoet-geary" class="level2" data-number="9.4">
      <h2 data-number="9.4" class="anchored" data-anchor-id="sec-standard-stoet-geary">
      <span class="header-section-number">9.4</span> Standardising PISA results</h2>'

b <- '<span class="fu">read_parquet</span>(<span 
     class="st">"&lt;folder&gt;PISA_2015_student_subset.parquet"</span>)</span></code><button 
     title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre> 
     </div>
     </div>
     </section><section id="sec-leftjoin" class="level2" data-number="9.3"><h2 data-number="9.3" 
     class="anchored" data-anchor-id="sec-leftjoin">
     <span class="header-section-number">9.3</span> Linking data using <code>left_join</code>
     </h2>
     <p>some text</p>'

c <- paste(a,b,a)

I can extract the title of the a using:

str_extract_all(a, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results"

But trying this on b returns nothing:

str_extract_all(b, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> character(0)

and c only returns the first and third instance of h2 when it should return all instances:

str_extract_all(c, '(?<=(<[/]span>)).*(?=(<[/]h))')[[1]] %>% str_squish()
> [1] "Standardising PISA results" "Standardising PISA results"

EDIT: from the comments this appears to be the regex not being able to parse the newline characters.

I've tried enabling single line mode in regex (?s) for the parsing, but it's still not working


Solution

  • Here's a helper function that will choose H2 eleements with spans but will ignore the spans

    library(xml2)
    library(stringr)
    
    geth2 <- function(x) {
      temp <- read_html(x) %>% xml_find_all("//h2[span]")
      xml_remove(xml_find_all(temp, ".//span"))
      temp %>% xml_text() %>% str_squish()  
    }
    
    geth2(a)
    # [1] "Standardising PISA results"
    geth2(b)
    # [1] "Linking data using left_join"
    

    If you wanted to keep the markup inside the H2, this could work

    geth2 <- function(x) {
      temp <- read_html(x) %>% xml_find_all("//h2[span]")
      xml_remove(xml_find_all(temp, ".//span"))
      temp %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish()  
    }
    geth2(a)
    # [1] "Standardising PISA results"
    geth2(b)
    # [1] "Linking data using <code>left_join</code>"
    

    For a version that will work with multiple H2 tags, you can use

    geth2 <- function(x) {
      temp <- read_html(x) %>% xml_find_all("//h2[span]")
      xml_remove(xml_find_all(temp, ".//span"))
      cleanup <- . %>% xml_contents() %>% as.character() %>% str_flatten(" ") %>% str_squish() 
      sapply(temp, cleanup)
    }
    geth2(c)
    # [1] "Standardising PISA results"
    # [2] "Linking data using <code>left_join</code>"
    # [3] "Standardising PISA results"