I am collecting some information from web and I have a very complex page I found recently. This page has a list of items/doctors but compared with other pages where you can find the index of the page in the web address when you move, this time it can not be seen even when I move to the last page. The page is next:
I am trying to get the list of all doctors in there. The detail is that at the end you find all the pages:
But moving to page 6 does not change the url of the page so I can not set a loop or other function to get all data.
In my effort to do this I sketched next code:
library(dplyr)
library(rvest)
#Code
page <- 'https://www.doctorisy.com/guatemala/medicos/'
#Read
html1 <- read_html(page)
#Get data
data <- html1 %>%
html_nodes("[class='d-flex row mx-0 cards-doctors mat-card mt-2']") %>%
html_text()
But it only gets the data of first page and it is incomplete because the first element of data belongs to the second name in the list:
data[[1]]
[1] "Neurocirujano Osberto Octavio De León López Edificio Sixtino I|6ta. Avenida 6-63 zona 10Teleconsulta|Online Q 300 • Efectivo, Transferencia 94 pacientes lo recomiendan \"\" Seguros: +4 Neurocirujano Osberto Octavio De León López Edificio Sixtino ...|6ta. Avenida 6-63 zona 10 Teleconsulta |Online Q 300 • Efectivo, Transferencia 94 pacientes lo recomiendan \"\" Seguros: +4"
I would like to have any suggestion on how to solve the issue of getting the data from all the pages (6 pages). And if it is possible format each element in data as a dataframe, like for example using data[[1]]
it should be:
Var1 Var2 Var3....
Neurocirujano Osberto Octavio De León López Edificio Sixtino I|6ta. Avenida 6-63 zona 10...
And having all the elements from data
as dataframes.
Many thanks.
Let's use a different method to webscrape this site. The API is unprotected. It will be much easier to scrape using their data API, and it will cost them less bandwidth too because you are only loading the data.
Fair warning, their API may change.
You can use the Network tab of the Inspector tool of your web browser to see what sources are being loaded. Look for the item being loaded which has the data we want in JSON format.
From the inspector tool it is possible to get the website address and all of the expected headers including the API key which they kindly left available for us.
Because you just asked for a list of doctors, that is what the code does below. I will leave it up to you to find the appointment API and interface with it if you also need that.
library(httr)
library(jsonlite)
# Using Chrome Inspector we can see that this is the API link
example_url <-"https://doctorisysearchprod.search.windows.net/indexes/profiles-index/docs?api-version=2017-11-11&&search=*&$filter=search.ismatch('Guatemala', 'country', 'full', 'all')&$top=10&$skip=10&facet=&$count=true&$orderby=orderPay desc, plid desc, orderNoPay desc, aleatory asc"
# Make a function that creates a valid URL based on inputs
doctorisy_url <- function(filter1 = 'Guatemala',
filter2 = 'country',
filter3 = 'full',
filter4 = 'all',
top = 10,
skip = 0){
example_url <- paste0(
"https://doctorisysearchprod.search.windows.net/indexes/profiles-index/docs?api-version=2017-11-11&",
"&search=*&$filter=search.ismatch('",filter1,"', '",filter2,"', '",filter3,"', '",filter4,"')",
"&$top=",top,"&$skip=",skip,
"&facet=&$count=true&$orderby=orderPay desc, plid desc, orderNoPay desc, aleatory asc"
)
example_url <- URLencode(example_url,repeated = TRUE)
}
# Make a custom GET function that adds the headers that the site expects to receive so we don't get a 404 or 400 error.
doctorisy_GET <- function(url,...){
GET(url,
add_headers(# Override default headers
authority = "doctorisysearchprod.search.windows.net",
# method = "GET",
# path = "/indexes/profiles-index/docs?api-version=2017-11-11&&search=*&$filter=search.ismatch(%27Guatemala%27,%20%27country%27,%20%27full%27,%20%27all%27)&$top=10&$skip=10&facet=&$count=true&$orderby=orderPay%20desc,%20plid%20desc,%20orderNoPay%20desc,%20aleatory%20asc",
# scheme = "https",
accept = "application/json, text/plain, */*",
`accept-encoding` = "gzip, deflate, br",
`accept-language` = "en-US,en;q=0.9",
`api-key` = "A7A0E69A1BB9C015591C62298F330840",
application = "WEB_Hlp5E88Gpj",
origin = "https://www.doctorisy.com",
referer = "https://www.doctorisy.com/",
# sec-ch-ua: "Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"
# sec-ch-ua-mobile: ?0
# sec-ch-ua-platform: "Windows"
# sec-fetch-dest: empty
# sec-fetch-mode: cors
# sec-fetch-site: cross-site
# time-zone: America/New_York
`user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
),
...)
}
# Pull the data from the API starting on page 1
page <- doctorisy_GET(doctorisy_url(skip = 0))
page_json <- jsonlite::fromJSON(rawToChar(page$content))
page_json$value$firstSurName
# [1] "Berganza" "De León" "Barreda" "Pozuelos" "Morales" "Pacheco" "Mayén" "Claverie" "Osorio" "Vitale"
# Pull the data from the API starting on page 2 (assuming we are getting top 10)
page <- doctorisy_GET(doctorisy_url(skip = 10))
page_json <- jsonlite::fromJSON(rawToChar(page$content))
page_json$value$firstSurName
# [1] "Quijada" "Astorga" "Asturias" "Arévalo" "López" "Piche" "Estrada" "Castellanos" "Liuti"
# [10] "Cabrera"
# Change the default top 10 to top 100 and start at 0.
# WARNING: API will max out at some unkown value of top. Going over top = 100 is risky.
page <- doctorisy_GET(doctorisy_url(top = 100, skip = 0))
page_json <- jsonlite::fromJSON(rawToChar(page$content))
page_json$value$firstSurName
# [1] "Berganza" "De León" "Barreda" "Pozuelos" "Morales" "Pacheco" "Mayén" "Claverie" "Osorio"
# [10] "Vitale" "Quijada" "Astorga" "Asturias" "Arévalo" "López" "Piche" "Estrada" "Castellanos"
# [19] "Liuti" "Cabrera" "Ochoa " "Lengua" "Hernández" "Rodriguez" "García" "Guillén" "Valencia"
# [28] "Choc" "Turcios" "Quan" "Saucedo B" "Villatoro" "López" "Amato" "Barillas" "Mendoza"
# [37] "Portillo" "Díaz" "MacDonald" "Barrios" "Garcia" "Araujo" "Carranza" "Estrada" "Bauer"
# [46] "Mayén" "De la Cruz" "Chacón" "Contreras" "Obregón" "Matta" "Ranchos" "Torres" "Labbé"
# [55] "Andrade" "Gutiérrez" "Paredes "
# Convert the JSON to a dataframe
df <- page_json$value
# > dplyr::glimpse(df)
# Rows: 57
# Columns: 65
# $ `@search.score` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
# $ profileId <int> 2533, 2592, 2893, 2679, 2563, 2904, 2727, 2924, 2688, 2580, 2686, 2845, 2862, 2855, 2953,~
# $ id <chr> "6835a46b-938d-400a-9acd-12a999ae2e50", "0e139036-38c8-4a1a-afcd-6eae1b3076ad", "10b381a2~
# $ firstName <chr> "Carmen", "Osberto", "Luis", "Julio", "Bernardo", "Bryan", "Patricia", "Carlos", "Miriam"~
# $ middleName <chr> "Amalia", "Octavio", "E", "Luis", "René", "Alexander", "Judith del Rosario", "Guillermo",~
# $ firstSurName <chr> "Berganza", "De León", "Barreda", "Pozuelos", "Morales", "Pacheco", "Mayén", "Claverie", ~
# $ secondSurName <chr> "De koninck", "López", "Matta", "López", "Ortiz", "Ureta", "De Villegas", "Martínez", "Ma~
# $ gender <chr> "female", "male", "male", "male", "male", "male", "female", "male", "female", "male", "fe~
# $ url <chr> "[\"https://blobdoctorisyprdo.blob.core.windows.net/profiles/ae1fb518-6ed8-4d6d-b187-fbaf~
# $ bio <chr> "Medico Especialista Oftalmología, Examen de Ojos de adultos y niños. Cirugia de Catarata~
# $ keywords <chr> "[]", "[]", "[]", "[]", "[]", "[]", NA, "[]", "[]", "[]", "[]", "[]", "[]", "[]", "[]", "~
# $ phoneCode <chr> "502", "502", "502", "502", "502", "502", "502", "502", "502", "502", "502", "502", "502"~
# $ phoneNumber <chr> "22697934", "57035956", "55159855", "59186012", "46504784", "30979749", "78320884", "5876~
# $ address <chr> "6a. Avenida 9-18 zona 10 Edificio Sixtino II", "Teleconsulta", "Teleconsulta", "Telecon~
# $ country <chr> "Guatemala", "Guatemala", "Guatemala", "Guatemala", "Guatemala", "Guatemala", "Guatemala"~
# $ city <chr> "Ciudad de Guatemala", "Ciudad de Guatemala", "Ciudad de Guatemala", "Ciudad de Guatemala~
# $ officeNumber <chr> "608", NA, NA, NA, NA, NA, "10", NA, NA, NA, "5A", "402", "1004", NA, "Oficina 8", "Clíni~
# $ placePhoneNumber <chr> "22697934", "57035956", "55159855", "59186012", "46504784", "30979749", "78320884", "5876~
# $ idCatalog <int> 4434, 4531, 5023, 4649, 4480, 5027, 4937, 5061, 4670, 4506, 4661, 4932, 5045, 5114, 5105,~
# $ name <chr> "Sixtino II", "Teleconsulta", "Teleconsulta", "Teleconsulta", "Teleconsulta", "Teleconsul~
# $ reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
# $ latitude <chr> "14.6039246016", "-0.1702710000", "-0.1702710000", "-0.1702710000", "-0.1702710000", "-0.~
# $ longitude <chr> "-90.5107120696", "-78.4700480000", "-78.4700480000", "-78.4700480000", "-78.4700480000",~
# $ isDeleted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE~
# $ mainSpc <chr> "Oftalmología", "Neurocirugía", "Gastroenterología", "Neurocirugía", "Manejo del Dolor", ~
# $ location <df[,3]> <data.frame[43 x 3]>
# $ slug <chr> "carmen-berganza-5507", "osberto-de-leon-5567", "luis-barreda-5904", "julio-pozuelos-5~
# $ pln <chr> "PlanAnual", "PlanAnual", "PlanAnual", "PlanAnual", "PlanAnual", "PlanAnual", "PlanAnual"~
# $ plid <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2,~
# $ aleatory <int> 255, 887, 325, 622, 124, 449, 666, 704, 769, 127, 714, 12, 405, 286, 173, 555, 589, 688, ~
# $ determinante <chr> "2021-11-19T04:46:02.483Z", "2021-11-19T04:46:02.483Z", "2021-11-19T04:46:02.483Z", "2021~
# $ n1n2 <chr> "Carmen Amalia", "Osberto Octavio", "Luis E", "Julio Luis", "Bernardo René", "Bryan Alexa~
# $ entireName <chr> "Carmen Berganza", "Osberto De León", "Luis Barreda", "Julio Pozuelos", "Bernardo Morales~
# $ n1a2 <chr> "Carmen De koninck", "Osberto López", "Luis Matta", "Julio López", "Bernardo Ortiz", "Bry~
# $ n2n1 <chr> "Amalia Carmen", "Octavio Osberto", "E Luis", "Luis Julio", "René Bernardo", "Alexander B~
# $ n2a1 <chr> "Amalia Berganza", "Octavio De León", "E Barreda", "Luis Pozuelos", "René Morales", "Alex~
# $ n2a2 <chr> "Amalia De koninck", "Octavio López", "E Matta", "Luis López", "René Ortiz", "Alexander U~
# $ a1n1 <chr> "Berganza Carmen", "De León Osberto", "Barreda Luis", "Pozuelos Julio", "Morales Bernardo~
# $ a1n2 <chr> "Berganza Amalia", "De León Octavio", "Barreda E", "Pozuelos Luis", "Morales René", "Pach~
# $ a1a2 <chr> "Berganza De koninck", "De León López", "Barreda Matta", "Pozuelos López", "Morales Ortiz~
# $ a2n1 <chr> "De koninck Carmen", "López Osberto", "Matta Luis", "López Julio", "Ortiz Bernardo", "Ure~
# $ a2n2 <chr> "De koninck Amalia", "López Octavio", "Matta E", "López Luis", "Ortiz René", "Ureta Alexa~
# $ a2a1 <chr> "De koninck Berganza", "López De León", "Matta Barreda", "López Pozuelos", "Ortiz Morales~
# $ n1n2a1 <chr> "Carmen Amalia Berganza", "Osberto Octavio De León", "Luis E Barreda", "Julio Luis Pozuel~
# $ n1n2a2 <chr> "Carmen Amalia De koninck", "Osberto Octavio López", "Luis E Matta", "Julio Luis López", ~
# $ n1a1a2 <chr> "Carmen Berganza De koninck", "Osberto De León López", "Luis Barreda Matta", "Julio Pozue~
# $ n2a1a2 <chr> "Amalia Berganza De koninck", "Octavio De León López", "E Barreda Matta", "Luis Pozuelos ~
# $ likes <int> 122, 97, 96, 96, 88, 88, 74, 61, 56, 28, 22, 13, 11, 3, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, ~
# $ orderPay <int> 1100000122, 1100000097, 1100000096, 1100000096, 1100000088, 1100000088, 1100000074, 11000~
# $ orderNoPay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
# $ isTemporal <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE~
# $ enableVideoAppointment <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,~
# $ enableVideoAppointmentPayments <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,~
# $ anticipatedAppointment <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "4", "1", "1", "1", "2", "1", "1", "1", "1",~
# $ anticipatedAppointmentInterval <chr> "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "HOUR", "~
# $ isMedicalCenter <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE~
# $ limitAppointment <chr> "7", "7", "60", "7", "12", "60", "48", "98", "7", "7", "7", "48", "7", "6", "7", "7", "99~
# $ limitAppointmentInterval <chr> "MONTH", "MONTH", "MONTH", "MONTH", "MONTH", "MONTH", "MONTH", "MONTH", "MONTH", "MONTH",~
# $ showCallWpp <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,~
# $ spc <list> [<data.frame[1 x 4]>], [<data.frame[2 x 4]>], [<data.frame[4 x 4]>], [<data.frame[2 x 4]>~
# $ plc <list> [<data.frame[1 x 16]>], [<data.frame[2 x 16]>], [<data.frame[2 x 16]>], [<data.frame[2 x~
# $ lang <list> [<data.frame[2 x 1]>], [<data.frame[2 x 1]>], [<data.frame[2 x 1]>], [<data.frame[2 x 1]~
# $ ins <list> [<data.frame[1 x 3]>], [<data.frame[6 x 3]>], [<data.frame[7 x 3]>], [<data.frame[8 x 3]~
# $ affiliate <list> [<data.frame[0 x 0]>], [<data.frame[0 x 0]>], [<data.frame[1 x 2]>], [<data.frame[0 x 0]~
# $ poll <list> [<data.frame[1 x 4]>], [<data.frame[1 x 4]>], [<data.frame[1 x 4]>], [<data.frame[1 x 4]~