Search code examples
rrcurl

Google search links obtain by webscraping in R are not in required format


I am new to web scraping in R and trying to run google search action using a search term from R and extract links automatically. I am partially successful in obtaining the links of google search results using RCurl and XML package. However, the href links I extract include unwanted information and are not in the format of a "URL".

The code I use is:

html <- getURL(u)
links <- xpathApply(doc, "//h3//a[@href]", xmlGetAttr, 'href')
links <- grep("http://", links, fixed = TRUE, value=TRUE)

The above code gives me seven links, however, they are in the below format:

[1] "/url?q=http://theguitarrepairworkshop.com/services/&sa=U&ved=0ahUKEwiOnNXzsr7OAhWHAMAKHX_LApYQFggmMAM&usg=AFQjCNF1r13FMHXXTsxMkbwzortiWKDALQ" 

I would prefer them to be:

http://theguitarrepairworkshop.com/services/

How do I extract the href as above?


Solution

  • Using rvest package (which also uses XML package but has a lot of handy features related to scraping)

    library(rvest)
    ht <- read_html('https://www.google.co.in/search?q=guitar+repair+workshop')
    links <- ht %>% html_nodes(xpath='//h3/a') %>% html_attr('href')
    gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
    

    Output:

    [1] "http://theguitarrepairworkshop.com/"                                                                   
    [2] "http://www.justdial.com/Delhi-NCR/Guitar-Repair-Services/ct-134788"                                    
    [3] "http://www.guitarrepairshop.com/"                                                                      
    [4] "http://www.guitarworkshoponline.com/"                                                                  
    [5] "http://www.guitarrepairbench.com/guitar-building-projects/guitar-workshop/guitar-workshop-project.html"
    [6] "http://www.guitarservices.com/"                                                                        
    [7] "http://guitarworkshopglasgow.com/pages/repairs-1"                                                      
    [8] "http://brightonguitarworkshop.co.uk/"                                                                  
    [9] "http://www.luth.org/resources/schools.html"   
    

    The fourth line in the code cleans the text. First splits the resulted url (that comes with garbage) wrt '&' and then takes the first element of the resulted split and replaces '/url?q=' with empty.

    Hope it helps!