Search code examples
rrvestrselenium

Web scraping in R, page doesn't respond to request


I'm unable, most of the time, to make request to the following website:

https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html

library(rvest);library(tibble);library(httr2)

base_url <- "https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html"

parsed_base_url <- base_url |> 
  read_html()  # This works sometimes and I get the underlying html

# THIS NEVER WORKS
pagina_parsed <- base_url |> 
  request() |> 
  req_user_agent(
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
  ) |> 
  req_headers(
    Referer = "https://www.adondevivir.com/",
    Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    `Accept-Language` = "es-419,es;q=0.6",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `Cache-Control` = "max-age=0",
    `Sec-Ch-Ua` = '"Brave";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
    Priority = "u=0, i"
  ) |> 
  req_perform()

Why can't I make request to the page most of the time (nor to mention that it doesn't work with httr2 with headers provided above)? Is there a way to overcome this "problem" with httr2? Does it has to do with cookies or a way the page is protecting itself from being scraped?

I guess I could retry a lot of times the request until it works, but I think I would not learn much about the why it doesn't work.

EDIT:

# LINUX

request("https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html") |> 
  req_headers(
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    `accept-language` = "es-419,es;q=0.6",
    `cache-control` = "max-age=0",
    cookie = "cf_clearance=lfXt_1Txp.pA9GgXL1760HCFSoERF3gAMYfPCQjqlaU-1722847065-1.0.1.1-tuClc.30RzuH4z81zAhO1rzgl_jM0oNzmWf30to9JUXnEzkyzCcs_nzU.ZtTQSZRaN.VYnjCppRq.BfZQIZQ0A; sessionId=60a093dd-3544-4174-852b-18251d4e9a21; allowCookies=true; cookiesPreferencesUser=%257B%2522functionality%2522%253Atrue%252C%2522performance%2522%253Atrue%252C%2522traceability%2522%253Atrue%252C%2522advertising%2522%253Atrue%257D; __cfruid=8acadb99e37aa66eb23792f9cdfa89b3fee9d38c-1722977691; _cfuvid=2fgYYpdcBNjQSoNEtokGOkkkJyFkLQjoxAyRoeo1Xqk-1722977691025-0.0.1.1-604800000; __cf_bm=3TXJj5SEFzoZZcX.Swpc4nnb.YmLoK30bOfxgCfjKUk-1723008440-1.0.1.1-zxZjCEZof_blUCi.uFm7lAFb6eiGuDHazG7LaHjuHenlWJlRkHT4u2Nm8JBCt8wHLX642NJCWXaAIOy2GYCEzWIBPFwPmy9YH9JEMs1.eRs; JSESSIONID=9E3D4BEC919B49CB53DD9020DFF1CD75",
    priority = "u=0, i",
    `sec-gpc` = "1",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
  ) |> 
  req_perform() # Doesn't work


# WINDOWS 

request("https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html") |> 
  req_headers(
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "es-419,es;q=0.9,es-ES;q=0.8,en;q=0.7,en-GB;q=0.6,en-US;q=0.5,es-PE;q=0.4",
    `cache-control` = "max-age=0",
    cookie = "_gcl_au=1.1.2060187212.1722865882; _ga=GA1.1.84685970.1722865883; sessionId=b5c67fa1-61e3-488b-bdce-b37a35360c2b; cf_clearance=nFFFwie0P4hXnQvrqhhuUN2sZYLtLGD0bU4iwnZRD1E-1722865884-1.0.1.1-aq14LXmv.ShobWDHCGKDzGn1pOpQznexx3JaetuO2Feik0F6fgL62Aa93YTUHaJ6MQg9GWBWNR0oZhnpk6Ov5Q; _hjSessionUser_212841=eyJpZCI6Ijc5NWZjYzA2LTc3MmItNThkMi1iZjM1LTRlMWJkZDAwNDA3YyIsImNyZWF0ZWQiOjE3MjI4NjU4ODg4MzksImV4aXN0aW5nIjp0cnVlfQ==; allowCookies=true; cookiesPreferencesUser=%257B%2522functionality%2522%253Atrue%252C%2522performance%2522%253Atrue%252C%2522traceability%2522%253Atrue%252C%2522advertising%2522%253Atrue%257D; __cf_bm=hLb.MuJ3t.F0Hwyk.BFvq9rdFZbPBGHqZWPfbNaN8x0-1723009061-1.0.1.1-2tr2Tg9M7tQqIrAoeqvtUIu25PZjCP37j2oAFNxXcwKSM.EWOJOJshM7GAN.a8f7dfIhEbZyLg_K2ntKqOxBsjkNLjUZ8AVRxy.KcPXcffs; __cfruid=7b233df105ab2e2047d76c945a6279b9327006c0-1723009061; _cfuvid=ACiINLrLTnj1AdfclBPSh72Li.Vljdva5riRLgu.aeI-1723009061971-0.0.1.1-604800000; _ga_2CDWC2XXVB=GS1.1.1723009064.8.0.1723009064.60.0.0; JSESSIONID=5FB571AEFECEE31A66970297E7A9CDD1; __gads=ID=1720f8ecfd9469e8:T=1722866024:RT=1723009064:S=ALNI_MYvcna1rXM7g_umHwCOx9Q7xwvJag; __gpi=UID=00000a4b543eb955:T=1722866024:RT=1723009064:S=ALNI_MYVYTTN6gDbpewSXyheQu6gYHpyEQ; __eoi=ID=beb4c0e62cecb05c:T=1722866024:RT=1723009064:S=AA-AfjY1R2Nq48cVs_hAMX2XOghF; _hjSession_212841=eyJpZCI6Ijc4OWUwZDYxLWFhZjItNDQ0OS04YzYzLTk0ZjE5ZDVhNGE3MyIsImMiOjE3MjMwMDkwNjcxNTgsInMiOjAsInIiOjAsInNiIjowLCJzciI6MCwic2UiOjAsImZzIjowLCJzcCI6MH0=",
    priority = "u=0, i",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0",
  ) |> 
  req_perform() # WORKS (And Yes, I copied this from my Windows machine to my Linux machine and it worked)

Solution

  • Does it has to do with cookies or a way the page is protecting itself from being scraped?

    Both. It's protected by Cloudflare which goes through a series of shenanigans to confuse automated tools and headless browsers. If you open that site in a fresh session or Incognito mode of your browser, you should actually see Cloudflare Javascript challenge in action. And if you have network tab of DevTools open (with preserve log enabled, and perhaps throttling too to slow things down a bit), you should see some other hints for what gets probed. If Cloudflare finds your request to be legit, cookies are set to grant passage during your session. You can reuse those with httr2.

    Semi-manual approach might go something like this:

    • open page in your browser, get through Cloudflare
    • navigate to that same url again with DevTools open
    • copy request as cURL (left-click on request in network tab of DevTools)
    • pass it to httr2::curl_translate()
    library(rvest)
    library(httr2)
    
    # translate curl to httr2:
    curl_translate(r"(curl 'https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html' \
      -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8' \
      -H 'accept-language: en-GB,en;q=0.9' \
      -H 'cookie: __cf_bm=MbgHRAOsR8nrNwVtUAXzb0HMRrBSK4hbiiNtO3mg41A-1722860033-1.0.1.1-WvLWPer9d5s3PyuFjRZeCwyIRmiyELE5bs40JWH4Txc4OZXWvFUaSgqUbvbf_xYKnpePSYWv5GY4btcLBR_vvc6pt01F1sLPXt0QVrPolJk; sessionId=b5a7a506-4689-4188-88e7-fe005fc154ab; cf_clearance=vchooYPfzwbG.fmcao_YheRm7DPILRmr8xhNSWLUabQ-1722860102-1.0.1.1-tYQZHx6sTEzzRUS.UHz1rjZNq1a1VcSoOcn7l0EjRqbFeHNUwzsHyhTjW2R0RU_Tnv.6L5WbxxEy8m3xxcepaw' \
      -H 'priority: u=0, i' \
      -H 'sec-ch-ua: "Not)A;Brand";v="99", "Brave";v="127", "Chromium";v="127"' \
      -H 'sec-ch-ua-mobile: ?0' \
      -H 'sec-ch-ua-model: ""' \
      -H 'sec-ch-ua-platform: "Windows"' \
      -H 'sec-ch-ua-platform-version: "15.0.0"' \
      -H 'sec-fetch-dest: document' \
      -H 'sec-fetch-mode: navigate' \
      -H 'sec-fetch-site: none' \
      -H 'sec-fetch-user: ?1' \
      -H 'sec-gpc: 1' \
      -H 'upgrade-insecure-requests: 1' \
      -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36')")
    #> request("https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html") |> 
    #>   req_headers(
    #>     accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    #>     `accept-language` = "en-GB,en;q=0.9",
    #>     cookie = "__cf_bm=MbgHRAOsR8nrNwVtUAXzb0HMRrBSK4hbiiNtO3mg41A-1722860033-1.0.1.1-WvLWPer9d5s3PyuFjRZeCwyIRmiyELE5bs40JWH4Txc4OZXWvFUaSgqUbvbf_xYKnpePSYWv5GY4btcLBR_vvc6pt01F1sLPXt0QVrPolJk; sessionId=b5a7a506-4689-4188-88e7-fe005fc154ab; cf_clearance=vchooYPfzwbG.fmcao_YheRm7DPILRmr8xhNSWLUabQ-1722860102-1.0.1.1-tYQZHx6sTEzzRUS.UHz1rjZNq1a1VcSoOcn7l0EjRqbFeHNUwzsHyhTjW2R0RU_Tnv.6L5WbxxEy8m3xxcepaw",
    #>     priority = "u=0, i",
    #>     `sec-gpc` = "1",
    #>     `upgrade-insecure-requests` = "1",
    #>     `user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
    #>   ) |> 
    #>   req_perform()
    
    # make request, parse html response 
    request("https://www.adondevivir.com/proyectos-etapa-pre-venta-en-construccion.html") |> 
      req_headers(
        accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        `accept-language` = "en-GB,en;q=0.9",
        cookie = "__cf_bm=MbgHRAOsR8nrNwVtUAXzb0HMRrBSK4hbiiNtO3mg41A-1722860033-1.0.1.1-WvLWPer9d5s3PyuFjRZeCwyIRmiyELE5bs40JWH4Txc4OZXWvFUaSgqUbvbf_xYKnpePSYWv5GY4btcLBR_vvc6pt01F1sLPXt0QVrPolJk; sessionId=b5a7a506-4689-4188-88e7-fe005fc154ab; cf_clearance=vchooYPfzwbG.fmcao_YheRm7DPILRmr8xhNSWLUabQ-1722860102-1.0.1.1-tYQZHx6sTEzzRUS.UHz1rjZNq1a1VcSoOcn7l0EjRqbFeHNUwzsHyhTjW2R0RU_Tnv.6L5WbxxEy8m3xxcepaw",
        priority = "u=0, i",
        `sec-gpc` = "1",
        `upgrade-insecure-requests` = "1",
        `user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
      ) |> 
      req_perform() |> 
      resp_body_html() |> 
      html_elements("h3[data-qa='POSTING_CARD_DESCRIPTION']") |> 
      html_text() |> 
      head() |> 
      stringr::str_trunc(80)
    #> [1] "Vive en el distrito patriota de Lima, hogar de libertadores. Disfruta de una ..."  
    #> [2] "Obra en curso - 47% vendido! Un proyecto inigualable ubicado en la zona monum..."  
    #> [3] "¡Vive frente al Campo de Marte en Jesús María! Presentamos \"Salaverry 571\", u..."
    #> [4] "¡Vive en la mejor zona de Surquillo! Lobby, Sala de niños, Sala de Usos Multi..."  
    #> [5] "Grupo Lar, única inmobiliaria en Perú, con presencia en 5 países en simultáne..."  
    #> [6] "Proyecto exclusivo en la Nueva Santa Catalina, a 5 minutos del centro financi..."
    

    Created on 2024-08-05 with reprex v2.1.0