Search code examples
htmlrweb-scrapinghref

How can I get href attr from this website?


I'm trying to parse the html of this web site, and when I get the html_nodes from the supposed links it get the response "" for all the nodes. What am I doing wrong?

texto_01 <- read_html(URL)
titulos_noticias <- texto_01 %>% html_nodes("p") %>% html_nodes("div") %>% html_nodes("ol") %>% html_nodes("li")  %>% html_nodes("a")
titulos_noticias_texto <- html_attr(titulos_noticias,"href")
titulos_noticias_texto

Apreciate the help. Tks a lot, Felipe


Solution

  • The content is loaded dynamically. You can see the page conducting a search and then returning a result set. You need to mimic the actual search request you can find in the network tab. The results returned are in json format. The data of interest is within r$Rows and you construct the url by concatenating parts:

    paste0("https://www.bcb.gov.br/estabilidadefinanceira/exibenormativo?tipo=", item$TipodoNormativoOWSCHCS,'&numero=',as.integer(item$NumeroOWSNMBR))
    

    You can use paste0 and map_df to handle this url reconstruction in a loop over the json object returned from r$Rows.

    You can see the javascript handling this process at line 6816 of the js file https://www.bcb.gov.br/BcbModule.cdb75dd11ebbc7b56192.js found in the sources tab.

    enter image description here

    Note that the js is using an already set variable found at line 5609

    enter image description here


    R:

    library(jsonlite)
    library(purrr)
    
    r = jsonlite::read_json('https://www.bcb.gov.br/api/search/app/normativos/buscanormativos?querytext=ContentType:normativo AND contentSource:normativos AND cessão&rowlimit=15&startrow=0&sortlist=Data1OWSDATE:descending&refinementfilters=Data:range(datetime(2018-09-17),datetime(2019-09-20T23:59:59))')
    
    df <- map_df(r$Rows, function(item) {
      data.frame(title = item$title,
                 url = paste0("https://www.bcb.gov.br/estabilidadefinanceira/exibenormativo?tipo=", item$TipodoNormativoOWSCHCS,'&numero=',as.integer(item$NumeroOWSNMBR)),
                 stringsAsFactors=FALSE)
    })
    
    head(df)