Search code examples
rweb-scrapingv8rvest

Scraping java scripted objects using rvest


I am trying to scrape java scripted objects from a webpage. I tried the JIRA API as suggested but I am not getting the activity log. I found a website explaining how java scripted objects can be scraped. For example, see below

https://datascienceplus.com/scraping-javascript-rendered-web-content-using-r/

I followed the example but I am finding it hard to understand what I need to send as xpath information to get the activity log listed. I am trying to scrape the activity log which is under the all-tab container in the bottom of webpage.

library(rvest)
library(V8)
#URL with js-rendered content to be scraped

link<- 'https://issues.apache.org/jira/browse/AMQCPP-645'
#Read the html page content and extract all javascript codes that are inside a list
#html<- getURL(link, followlocation = TRUE)
 emailjs <- read_html(link) %>% html_nodes(xpath = "//div") %>% html_text()


  ct <- v8()
 #parse the html content from the js output and print it as text
   read_html(ct$eval(gsub('document.write','',emailjs))) %>% 
   html_text()

I was hoping to get output like this:

       rows  emailjs
        1      S A created issue - 25/Apr/19 15:48 Highlight in document.    
        2      Justin Bertram made changes - 25/Apr/19 17:53 Field Original Value 
      New 
     Value  Comment [ I'm using Firefox, and it's working no problem. It's 
     just HTML so    there shouldn't be any browser compatibility issues. 
     My guess is that Firefox  is holding on to an older, cached version or 
     something. Try opening a "private browsing" window and trying it from 
     there. ] Highlight in document.

       3      Timothy Bish made changes - 25/Apr/19 18:10 Resolution Fixed [ 1 ] 
        Status 
      Open [ 1 ] Closed [ 6 ] Highlight in document.
       4       Timothy Bish made transition - 25/Apr/19 18:10 Open Closed 2h 22m 1

Suggestions would be greatly appreciated. Thank you!


Solution

  • You can mimic the POST request the page makes and add the one required header. Then html parse response for desired content. You may need to do a little more string tidying.

    library(httr)
    library(rvest)
    library(magrittr)
    
    headers = c('X-Requested-With' = 'XMLHttpRequest')
    
    data = '[{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel","tabPosition":1},"timeDelta":-4904},{"name":"jira.viewissue.tab.clicked","properties":{"inNewWindow":false,"keyboard":false,"context":"unknown","tab":"com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel","tabPosition":0},"timeDelta":-4178}]'
    
    rows <- read_html(httr::POST(url = 'https://issues.apache.org/jira/browse/AMQCPP-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel&_=1570029676497', httr::add_headers(.headers=headers), body = data))%>%
            html_nodes('.issuePanelWrapper .issue-data-block')%>%
            html_text()%>% 
            gsub('\\s+|\n+', ' ', .)