Search code examples
rquantmodrvest

Using rvest to scrape a website - Selecting html node?


I have a question about my latest r vest scrape.

I want to scrape this page (and some other stocks as well): http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1

I need a list of the Market Capital, which is the first box in the second line. This list should contain approx 50-100 stocks.

I am using rvest for that.

library(rvest)

html = read_html("http://www.finviz.com/quote.ashx?t=A")

cast = html_nodes(html, "table-dark-row")

The problem is, I can not get around the html_nodes. Any idea about how to find out the correct node for the html_nodes?

I am using firebug/firefinder to check out the webpage.


Solution

  • Not sure if this is what you want because I cannot find a list with aprox. 50-100 stocks.

    But for what is worth, using SelectorGadget I came up with this node .table-dark-row:nth-child(2) .snapshot-td2:nth-child(2), to select the Market Cap (first box in the second line of this page http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1).

    > library(rvest)
    > 
    > html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")
    > 
    > cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")
    > cast
    {xml_nodeset (1)}
    [1] <td width="8%" class="snapshot-td2" align="left">\n  <b>11.58B</b>\n</td>
    > 
    

    If this is not exactly what you want, just use SelectorGadget to locate what you want.

    Hope this helps.

    EDIT :

    Here complete solution:

    library(rvest)
    
    html = read_html("http://www.finviz.com/quote.ashx?t=AA&ty=c&p=d&b=1")
    
    cast = html_nodes(html, ".table-dark-row:nth-child(2) .snapshot-td2:nth-child(2)")
    
    html_text(cast) %>%
        gsub(pattern = "B", replacement = "") %>%
        as.numeric()