Search code examples
htmlcssrweb-scrapingrvest

Rvest: Web Scraping Japanese Baseball Website


I am trying to scrape two tables from the npb.jp website using the rvest package in R. I have tried using CSS selectors for the 2 tables but to no avail. Could the issue lie in the format of the webpage?

Code:

html  <- read_html("https://npb.jp/bis/eng/2022/stats/std_c.html")
css <- "#stdivmaintbl > table > tbody > tr > td > div:nth-child(1)"
nodes <-  html_nodes(html, css)
table <-  html_table(nodes)[[1]]

df <- data.frame(table)

The code is reading in the html but cannot seem to find the table.

Appreciate any assistance.


Solution

  • For whatever reason when I tried to directly read the url I got an error about a certificate, so I copied and pasted the source html into a file instead of reading it in using the URL. I'm assuming what I read in from file should still be the same as what you read in from the internet. This worked for me:

    library(rvest)
    library(magrittr)
    
    
    # this is where I saved the page's html
    # assuming you don't have the same certificate problem I had, 
    # you could use this instead: url <- "https://npb.jp/bis/eng/2022/stats/std_c.html"
    url <- "baseball.html"
    
    table <- url %>% read_html() %>% html_nodes(".stdtblmain") %>% html_table()
    
    table[[1]]
    
    > table[[1]]
    # A tibble: 27 × 239
       X1        X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13   X14   X15   X16   X17   X18   X19   X20  
       <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
     1 "TeamGWL… "Tea… G     W     L     T     PCT   "GB"  ""    Home  Road  ""    "vsS" vsDB  vsT   vsG   vsC   vsD   Int   Toky…
     2 "Team"    "G"   W     L     T     PCT   GB    ""    ""    Home  Road  ""    ""    vsS   vsDB  vsT   vsG   vsC   vsD   Int  
     3 "Tokyo Y… ""    Toky… 143   80    59    4     ""    ""    .576  --    ""    ""    37-34 43-2… ***   16-9  13-1… 11-1… 16-8…
     4 ""        "Tok… NA    NA    NA    NA    NA    ""    ""    NA    NA    ""    ""    NA    NA    NA    NA    NA    NA    NA   
     5 "YOKOHAM… ""    YOKO… 143   73    68    2     ""    ""    .518  8.0   ""    ""    41-3… 32-3… 9-16  ***   16-9  13-1… 8-17 
     6 ""        "YOK… NA    NA    NA    NA    NA    ""    ""    NA    NA    ""    ""    NA    NA    NA    NA    NA    NA    NA   
     7 "Hanshin… ""    Hans… 143   68    71    4     ""    ""    .489  12.0  ""    ""    37-3… 31-3… 11-1… 9-16  ***   14-1… 9-14…
     8 ""        "Han… NA    NA    NA    NA    NA    ""     NA   NA    NA     NA   ""    NA    NA    NA    NA    NA    NA    NA   
     9 "Yomiuri… ""    Yomi… 143   68    72    3     ".48… "12.… 35-3… 33-3… "13-… "11-… 10-1… ***   13-12 13-12 8-10  NA    NA   
    10 ""        "Yom… NA    NA    NA    NA    NA     NA    NA   NA    NA     NA    NA   NA    NA    NA    NA    NA    NA    NA   
    # … with 17 more rows, and 219 more variables: X21 <chr>, X22 <chr>,