Search code examples
rweb-scrapinghtml-parsingrvest

How to scrape tables inside a comment tag in html with R?


I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--

What is the best way to get the tables from inside the comment tags? Thanks!

Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none


Solution

  • Ok..got it.

    library(stringi)
    library(knitr)
    library(rvest)
    
    
     any_version_html <- function(x){
           XML::htmlParse(x)
        }
    a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
    b <- readLines(a)
    c <- paste0(b, collapse = "")
    d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
    
    e <- html_table(any_version_html(d))
    
    
    > kable(summary(e),'rst')
    ======  ==========  ====
    Length  Class       Mode
    ======  ==========  ====
    9       data.frame  list
    2       data.frame  list
    24      data.frame  list
    21      data.frame  list
    28      data.frame  list
    28      data.frame  list
    27      data.frame  list
    30      data.frame  list
    27      data.frame  list
    27      data.frame  list
    28      data.frame  list
    28      data.frame  list
    27      data.frame  list
    30      data.frame  list
    27      data.frame  list
    27      data.frame  list
    3       data.frame  list
    ======  ==========  ====
    
    
    kable(e[[1]],'rst')
    
    
    ===  ================  ===  ====  ===  ==================  ===  ===  =================================
    No.  Player            Pos  Ht     Wt  Birth Date          Â    Exp  College                          
    ===  ================  ===  ====  ===  ==================  ===  ===  =================================
     41  Cameron Bairstow  PF   6-9   250  December 7, 1990    au   R    University of New Mexico         
      0  Aaron Brooks      PG   6-0   161  January 14, 1985    us   6    University of Oregon             
     21  Jimmy Butler      SG   6-7   220  September 14, 1989  us   3    Marquette University             
     34  Mike Dunleavy     SF   6-9   230  September 15, 1980  us   12   Duke University                  
     16  Pau Gasol         PF   7-0   250  July 6, 1980        es   13                                    
     22  Taj Gibson        PF   6-9   225  June 24, 1985       us   5    University of Southern California
     12  Kirk Hinrich      SG   6-4   190  January 2, 1981     us   11   University of Kansas             
      3  Doug McDermott    SF   6-8   225  January 3, 1992     us   R    Creighton University    
    
    
    ## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
    
    # Names are in h2-tags
    e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
    e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
    names(e) <- e_names
    kable(head(e$Salaries), 'rst')
    
    ===  ==============  ===========
     Rk  Player          Salary     
    ===  ==============  ===========
      1  Derrick Rose    $18,862,875
      2  Carlos Boozer   $13,550,000
      3  Joakim Noah     $12,200,000
      4  Taj Gibson      $8,000,000 
      5  Pau Gasol       $7,128,000 
      6  Nikola Mirotic  $5,305,000 
    ===  ==============  ===========