Search code examples
rmarkdownpandoc

How to get text between specific colon in markdown


I have multiple web pages where I want to investigate the markdown format. The problem I'm facing now is that the markdown output can be a real mess with useless tags. I would like to have all the text between a specific colon ::: with certain name. Here I'm trying to make a reproducible example (I cut a piece of output since it is really big):

library(rvest)
library(rmarkdown)
link = "https://stackoverflow.com/users/14282714/quinten"

page = read_html(link)
xml2::write_html(page, file = "SO_page.html")

pandoc_convert("SO_page.html", to = "markdown")

::: site-footer--col
##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}

-   [About](https://stackoverflow.co/){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 1 })"}
-   [Press](https://stackoverflow.co/company/press/){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 27 })"}
-   [Work
    Here](https://stackoverflow.co/company/work-here/){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 9 })"}
-   [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 7 })"}
-   [Privacy
    Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 8 })"}
-   [Terms of
    Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 37 })"}
-   [Contact Us](/contact){.js-gps-track .-link
    gps-track="footer.click({ location: 4, link: 13 })"}
-   [Cookie Settings]{#consent-footer-link}
-   [Cookie
    Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
    .-link gps-track="footer.click({ location: 4, link: 39 })"}
:::

Created on 2024-04-29 with reprex v2.1.0

Now I would like to have all the text for the site-footer--col collon. The problem is there a lot of these callout blocks with specific names. Also the ending of a collon is not clear. In your IDE it is a different color. So I was wondering if anyone knows how to extract the text of a specific callout block? Note I don't want to use HTML output, only markdown output because of its format.


Solution

  • Do I understand correctly: you need to extract the text between ::: site-footer--col and the next :::?

    I've modified the pandoc_convert() call to output the result to SO_page.md so that I can read it in as text. Then use stringr::str_extract_all() to pull out the required text.

    The args dotall = TRUE and multiline = TRUE allow us to search for multi-line regexes in the document.

    library(rvest)
    library(rmarkdown)
    link = "https://stackoverflow.com/users/14282714/quinten"
    
    page = read_html(link)
    xml2::write_html(page, file = "SO_page.html")
    
    pandoc_convert("SO_page.html", to = "markdown", output = "SO_page.md")
    
    markdown <- readr::read_file("SO_page.md")
    
    pattern <- stringr::regex("\\n::: site-footer--col.+?^:::", dotall = TRUE, multiline = TRUE)
    
    footers <- stringr::str_extract_all(markdown, pattern)[[1]]
    
    cat(footers, sep = "\n\n")
    #> 
    #> ::: site-footer--col
    #> ##### [Stack Overflow](https://stackoverflow.com){.js-gps-track gps-track="footer.click({ location: 4, link: 15})"} {#stack-overflow .-title}
    #> 
    #> -   [Questions](/questions){.js-gps-track .-link
    #>     gps-track="footer.click({ location: 4, link: 16})"}
    #> -   [Help](/help){.js-gps-track .-link
    #>     gps-track="footer.click({ location: 4, link: 3 })"}
    #> :::
    #> 
    #> 
    #> ::: site-footer--col
    #> ##### [Products](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 19 })"} {#products .-title}
    #> 
    #> -   [Teams](https://stackoverflow.co/teams/){.js-gps-track .-link
    #>     ga="[\"teams traffic\",\"footer - site nav\",\"stackoverflow.com/teams\",null,{\"dimension4\":\"teams\"}]"
    #>     gps-track="footer.click({ location: 4, link: 29 })"}
    #> -   [Advertising](https://stackoverflow.co/advertising/){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 21 })"}
    #> -   [Collectives](https://stackoverflow.co/collectives/){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 40 })"}
    #> -   [Talent](https://stackoverflow.co/talent/){.js-gps-track .-link
    #>     gps-track="footer.click({ location: 4, link: 20 })"}
    #> :::
    #> 
    #> 
    #> ::: site-footer--col
    #> ##### [Company](https://stackoverflow.co/){.js-gps-track gps-track="footer.click({ location: 4, link: 1 })"} {#company .-title}
    #> 
    #> -   [About](https://stackoverflow.co/){.js-gps-track .-link
    #>     gps-track="footer.click({ location: 4, link: 1 })"}
    #> -   [Press](https://stackoverflow.co/company/press/){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 27 })"}
    #> -   [Work
    #>     Here](https://stackoverflow.co/company/work-here/){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 9 })"}
    #> -   [Legal](https://stackoverflow.com/legal){.js-gps-track .-link
    #>     gps-track="footer.click({ location: 4, link: 7 })"}
    #> -   [Privacy
    #>     Policy](https://stackoverflow.com/legal/privacy-policy){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 8 })"}
    #> -   [Terms of
    #>     Service](https://stackoverflow.com/legal/terms-of-service/public){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 37 })"}
    #> -   [Contact Us](/contact){.js-gps-track .-link
    #>     gps-track="footer.click({ location: 4, link: 13 })"}
    #> -   [Cookie Settings]{#consent-footer-link}
    #> -   [Cookie
    #>     Policy](https://stackoverflow.com/legal/cookie-policy){.js-gps-track
    #>     .-link gps-track="footer.click({ location: 4, link: 39 })"}
    #> :::
    

    Created on 2024-04-29 with reprex v2.1.0