How to get text between specific colon in markdown

I have multiple web pages where I want to investigate the markdown format. The problem I'm facing now is that the markdown output can be a real mess with useless tags. I would like to have all the text between a specific colon ::: with certain name. Here I'm trying to make a reproducible example (I cut a piece of output since it is really big):

library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"

page = read_html(link)
xml2::write_html(page, file = "SO_page.html")

pandoc_convert("SO_page.html", to = "markdown")

::: site-footer--col
##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}

-   [About](https://stackoverflow.co/){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 1 })"}
-   [Press](https://stackoverflow.co/company/press/){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 27 })"}
-   [Work
    Here](https://stackoverflow.co/company/work-here/){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 9 })"}
-   [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 7 })"}
-   [Privacy
    Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 8 })"}
-   [Terms of
    Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 37 })"}
-   [Contact Us](/contact){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 13 })"}
-   [Cookie Settings]{#consent-footer-link}
-   [Cookie
    Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 39 })"}
:::

^{Created on 2024-04-29 with reprex v2.1.0}

Now I would like to have all the text for the site-footer--col collon. The problem is there a lot of these callout blocks with specific names. Also the ending of a collon is not clear. In your IDE it is a different color. So I was wondering if anyone knows how to extract the text of a specific callout block? Note I don't want to use HTML output, only markdown output because of its format.

Solution

Do I understand correctly: you need to extract the text between ::: site-footer--col and the next :::?

I've modified the pandoc_convert() call to output the result to SO_page.md so that I can read it in as text. Then use stringr::str_extract_all() to pull out the required text.

The args dotall = TRUE and multiline = TRUE allow us to search for multi-line regexes in the document.

library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"

page = read_html(link)
xml2::write_html(page, file = "SO_page.html")

pandoc_convert("SO_page.html", to = "markdown", output = "SO_page.md")

markdown <- readr::read_file("SO_page.md")

pattern <- stringr::regex("\\n::: site-footer--col.+?^:::", dotall = TRUE, multiline = TRUE)

footers <- stringr::str_extract_all(markdown, pattern)[[1]]

cat(footers, sep = "\n\n")
#> 
#> ::: site-footer--col
#> ##### [Stack Overflow](https://stackoverflow.com){.js-gps-track gps-track="footer.click({ location: 4, link: 15})"} {#stack-overflow .-title}
#> 
#> -   [Questions](/questions){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 16})"}
#> -   [Help](/help){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 3 })"}
#> :::
#> 
#> 
#> ::: site-footer--col
#> ##### [Products](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 19 })"} {#products .-title}
#> 
#> -   [Teams](https://stackoverflow.co/teams/){.js-gps-track .-link
#>     ga="[\"teams traffic\",\"footer - site nav\",\"stackoverflow.com/teams\",null,{\"dimension4\":\"teams\"}]"
#>     gps-track="footer.click({ location: 4, link: 29 })"}
#> -   [Advertising](https://stackoverflow.co/advertising/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 21 })"}
#> -   [Collectives](https://stackoverflow.co/collectives/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 40 })"}
#> -   [Talent](https://stackoverflow.co/talent/){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 20 })"}
#> :::
#> 
#> 
#> ::: site-footer--col
#> ##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
#> 
#> -   [About](https://stackoverflow.co/){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 1 })"}
#> -   [Press](https://stackoverflow.co/company/press/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 27 })"}
#> -   [Work
#>     Here](https://stackoverflow.co/company/work-here/){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 9 })"}
#> -   [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 7 })"}
#> -   [Privacy
#>     Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 8 })"}
#> -   [Terms of
#>     Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 37 })"}
#> -   [Contact Us](/contact){.js-gps-track .-link
#>     gps-track="footer.click({ location: 4, link: 13 })"}
#> -   [Cookie Settings]{#consent-footer-link}
#> -   [Cookie
#>     Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
#>     .-link gps-track="footer.click({ location: 4, link: 39 })"}
#> :::

^{Created on 2024-04-29 with reprex v2.1.0}