I have multiple web pages where I want to investigate the markdown format. The problem I'm facing now is that the markdown output can be a real mess with useless tags. I would like to have all the text between a specific colon :::
with certain name. Here I'm trying to make a reproducible example (I cut a piece of output since it is really big):
library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"
page = read_html(link)
xml2::write_html(page, file = "SO_page.html")
pandoc_convert("SO_page.html", to = "markdown")
::: site-footer--col
##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
- [About](https://stackoverflow.co/){.js-gps-track .-link
gps-track="footer.click({ location: 4, link: 1 })"}
- [Press](https://stackoverflow.co/company/press/){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 27 })"}
- [Work
Here](https://stackoverflow.co/company/work-here/){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 9 })"}
- [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
gps-track="footer.click({ location: 4, link: 7 })"}
- [Privacy
Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 8 })"}
- [Terms of
Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 37 })"}
- [Contact Us](/contact){.js-gps-track .-link
gps-track="footer.click({ location: 4, link: 13 })"}
- [Cookie Settings]{#consent-footer-link}
- [Cookie
Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
.-link gps-track="footer.click({ location: 4, link: 39 })"}
:::
Created on 2024-04-29 with reprex v2.1.0
Now I would like to have all the text for the site-footer--col
collon. The problem is there a lot of these callout blocks with specific names. Also the ending of a collon is not clear. In your IDE it is a different color. So I was wondering if anyone knows how to extract the text of a specific callout block? Note I don't want to use HTML output, only markdown output because of its format.
Do I understand correctly: you need to extract the text between ::: site-footer--col
and the next :::
?
I've modified the pandoc_convert()
call to output the result to SO_page.md so that I can read it in as text. Then use stringr::str_extract_all()
to pull out the required text.
The args dotall = TRUE
and multiline = TRUE
allow us to search for multi-line regexes in the document.
library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"
page = read_html(link)
xml2::write_html(page, file = "SO_page.html")
pandoc_convert("SO_page.html", to = "markdown", output = "SO_page.md")
markdown <- readr::read_file("SO_page.md")
pattern <- stringr::regex("\\n::: site-footer--col.+?^:::", dotall = TRUE, multiline = TRUE)
footers <- stringr::str_extract_all(markdown, pattern)[[1]]
cat(footers, sep = "\n\n")
#>
#> ::: site-footer--col
#> ##### [Stack Overflow](https://stackoverflow.com){.js-gps-track gps-track="footer.click({ location: 4, link: 15})"} {#stack-overflow .-title}
#>
#> - [Questions](/questions){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 16})"}
#> - [Help](/help){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 3 })"}
#> :::
#>
#>
#> ::: site-footer--col
#> ##### [Products](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 19 })"} {#products .-title}
#>
#> - [Teams](https://stackoverflow.co/teams/){.js-gps-track .-link
#> ga="[\"teams traffic\",\"footer - site nav\",\"stackoverflow.com/teams\",null,{\"dimension4\":\"teams\"}]"
#> gps-track="footer.click({ location: 4, link: 29 })"}
#> - [Advertising](https://stackoverflow.co/advertising/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 21 })"}
#> - [Collectives](https://stackoverflow.co/collectives/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 40 })"}
#> - [Talent](https://stackoverflow.co/talent/){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 20 })"}
#> :::
#>
#>
#> ::: site-footer--col
#> ##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
#>
#> - [About](https://stackoverflow.co/){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 1 })"}
#> - [Press](https://stackoverflow.co/company/press/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 27 })"}
#> - [Work
#> Here](https://stackoverflow.co/company/work-here/){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 9 })"}
#> - [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 7 })"}
#> - [Privacy
#> Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 8 })"}
#> - [Terms of
#> Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 37 })"}
#> - [Contact Us](/contact){.js-gps-track .-link
#> gps-track="footer.click({ location: 4, link: 13 })"}
#> - [Cookie Settings]{#consent-footer-link}
#> - [Cookie
#> Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
#> .-link gps-track="footer.click({ location: 4, link: 39 })"}
#> :::
Created on 2024-04-29 with reprex v2.1.0