Search code examples
rseleniumxpathrseleniumxpath-1.0

Get all the twitter links in a web page using RSelenium


I am trying to collect URLs from a webpage with Rselenium, but getting InvalidSelector error

Use R 3.6.0 on Windows 10 PC, Rselenium 1.7.5 with Chrome webdriver (chromever="75.0.3770.8")


library(RSelenium)

rD <- rsDriver(browser=c("chrome"), chromever="75.0.3770.8")
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()

url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
remDr$navigate(url)

tt <- remDr$findElements(using = "xpath", "//a[contains(@href,'http://twitter.com/')]/@href")

I expect to collect URLs to Twitter accounts of politicians listed. Instead I am getting the next error:

Selenium message:

invalid selector: The result of the xpath expression "//a[contains(@href,'http://twitter.com/')]/@href" is: [object Attr]. It should be an element.
  (Session info: chrome=75.0.3770.80)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/invalid_selector_exception.html
Build info: version: '4.0.0-alpha-1', revision: 'd1d3728cae', time: '2019-04-24T16:15:24'
System info: host: 'ALEX-DELL-17', ip: '10.0.75.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_191'
Driver info: driver.version: unknown

Error: Summary: InvalidSelector Detail: Argument was an invalid selector (e.g. XPath/CSS). class: org.openqa.selenium.InvalidSelectorException Further Details: run errorDetails method

When I make a similar search for very specific element all works fine, example:

tt <- remDr$findElement(value = '//a[@href = "http://twitter.com/AlboMP"]')

then

tt$getElementAttribute('href') 

returns me URL I need

What am I doing wrong?


Solution

  • I don't anything about R so I am posting an answer with python. As this post is about R, I learned some R basics and posting it too.

    The easiest way to get the twitter URL is by iterating through all the URLs in the webpage and check if it contains the word 'twitter' in it.

    In python (which works absolutely fine):

    driver.get('https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96')
    links = driver.find_elements_by_xpath("//a[@href]")
    for link in links:
        if 'twitter' in link.get_attribute("href"):
            print(link.get_attribute("href")
    

    Result:

    http://twitter.com/AlboMP http://twitter.com/SharonBirdMP
    http://twitter.com/Bowenchris http://twitter.com/tony_burke
    http://twitter.com/lindaburneymp http://twitter.com/Mark_Butler_MP
    https://twitter.com/terrimbutler http://twitter.com/AnthonyByrne_MP
    https://twitter.com/JEChalmers http://twitter.com/NickChampionMP
    https://twitter.com/LMChesters http://twitter.com/JasonClareMP
    https://twitter.com/SharonClaydon
    https://www.twitter.com/LibbyCokerMP
    https://twitter.com/JulieCollinsMP http://twitter.com/fitzhunter
    http://twitter.com/stevegeorganas https://twitter.com/andrewjgiles
    https://twitter.com/lukejgosling https://www.twitter.com/JulianHillMP http://twitter.com/stephenjonesalp https://twitter.com/gedkearney
    https://twitter.com/MikeKellyofEM http://twitter.com/mattkeogh
    http://twitter.com/PeterKhalilMP http://twitter.com/CatherineKingMP
    https://twitter.com/MadeleineMHKing https://twitter.com/ALEIGHMP
    https://twitter.com/RichardMarlesMP
    https://twitter.com/brianmitchellmp
    http://twitter.com/#!/RobMitchellMP
    http://twitter.com/ShayneNeumannMP https://twitter.com/ClareONeilMP
    http://twitter.com/JulieOwensMP
    http://www.twitter.com/GrahamPerrettMP
    http://twitter.com/tanya_plibersek http://twitter.com/AmandaRishworth http://twitter.com/MRowlandMP https://twitter.com/JoanneRyanLalor
    http://twitter.com/billshortenmp http://www.twitter.com/annewerriwa
    http://www.twitter.com/stemplemanmp
    https://twitter.com/MThistlethwaite
    http://twitter.com/MariaVamvakinou https://twitter.com/TimWattsMP
    https://twitter.com/joshwilsonmp

    In R: (This may be wrong but you can get an idea)

    library(XML)
    library(RCurl)
    library(RSelenium)
    url <- "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=1&gen=0&ps=96"
    doc <- getURL(url)
    parser <- htmlParse(doc)
    links <- xpathSApply(parser, "//a[@href]", xmlGetAttr, "href")
    for(link in links){
        if(grepl("twitter", link)){
            print(link)
        }
    }
    

    I don't even know if this code will work. But the idea is to get all the URLs in a page, iterate over it and check if the word twitter is in it. My answer is based on this