I tried extracting data from the following site:
https://www.zomato.com/ncr/restaurants/north-indian
using R programming, I'm a learner and beginner in this field!
I tried these:
> library(XML)
> doc<-htmlParse("the url mentioned above")
> Warning message:
> XML content does not seem to be XML: 'https://www.zomato.com/ncr/restaurants/north-indian'
This was one...I also tried the readLines()
to which the output was as follows:-
> readLines("the URL as mentioned above") [i can't specify more than two links so typing this]
> Error in file(con, "r") : cannot open the connection
> In addition: Warning message:
> In file(con, "r") : unsupported URL scheme
I understand that the page is not XML as shown in error stated, but what is other way around for me to capture the data from this site...I did try tidy html to convert it to XML or XHTML and then work it up but I reached nowhere, maybe I don't know the actual process of using tidy html yet! :( not sure! Suggest something to solve this issue and corrections if any are there?
The rvest
package is also pretty handy (and built on top of the XML
package, amongst other packages):
library(rvest)
pg <- html("https://www.zomato.com/ncr/restaurants/north-indian")
# extract all the restaurant names
pg %>% html_nodes("a.result-title") %>% html_text()
## [1] "Bukhara - ITC Maurya " "Karim's "
## [3] "Gulati " "Dhaba By Claridges "
## ...
## [27] "Dum-Pukht - ITC Maurya " "Maal Gaadi "
## [29] "Sahib Sindh Sultan " "My Bar & Restaurant "
# extract the ratings
pg %>% html_nodes("div.rating-div") %>% html_text() %>% gsub("[[:space:]]", "", .)
## [1] "4.3" "4.1" "4.2" "3.9" "3.8" "4.1" "4.1" "3.4" "4.1" "4.3" "4.2" "4.2" "3.9" "3.8" "3.8" "3.4" "4.0" "3.7" "4.1"
## [20] "4.0" "3.8" "3.8" "3.9" "3.8" "4.0" "4.0" "4.7" "3.8" "3.8" "3.4"