Search code examples
rweb-scrapingdownloadgzipxml-sitemap

403 Forbidden Error when Downloading XML.GZ File using Polite Package in R


I am trying to download a file from a URL using the polite package in R. Here is the code I am using:

library(polite)

# URL of the file to download
eprice_xml_products_1 <- "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz"

# Create a polite session
session <- bow(eprice_xml_products_1)

# Download the file using rip function
file_path <- rip(session, destfile = "xml_1.gz")

print(file_path)

I have also tried with this function:


    bow(eprice_xml_products_1) %>%
      nod("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz") %>%
      rip()

But I get this error:


    trying URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
    Error in fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  : 
      cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
    In addition: Warning messages:
    1: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  :
      downloaded length 0 != reported length 334
    2: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz",  :
      cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz': HTTP status was '403 Forbidden'

If I just open the link with my browser the download of the file starts immediately

What am I missing?


Solution

  • That page blocks requests for the url you are trying to access, when the user-agent value in the request headers is not a regular browser (Firefox, Chrome, ...). To make this work, you can change your user agent value to that of a Browser. Below is an example that works with utils::download.file(). A similar strategy might be available for polite.

    # Set User Agent to current Firefox
      options(HTTPUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/115.0")
      download.file("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", "Sitemap_Elettrodomestici_1.xml.gz")
      
      # Load XML from file
      library(xml2)
      read_xml("Sitemap_Elettrodomestici_1.xml.gz")
    #> {xml_document}
    #> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    #>  [1] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
    #>  [2] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
    #>  [3] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
    #>  [4] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DHaier%2D ...
    #>  [5] <url>\n  <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DMIDEA%2DFrigorif ...
    #>  [6] <url>\n  <loc>https://www.eprice.it/Accessori%2DFrigoriferi%2DELECTROLUX ...
    #>  [7] <url>\n  <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D1597166</loc ...
    #>  [8] <url>\n  <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D2489361</loc ...
    #>  [9] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DDe%20Longhi/d% ...
    #> [10] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
    #> [11] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
    #> [12] <url>\n  <loc>https://www.eprice.it/accessori%2Dincasso%2DELUX%20INC/d%2 ...
    #> [13] <url>\n  <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D5551714</loc ...
    #> [14] <url>\n  <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D7625838</loc ...
    #> [15] <url>\n  <loc>https://www.eprice.it/accessori%2DKitchenAid/d%2D50118434< ...
    #> [16] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
    #> [17] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
    #> [18] <url>\n  <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
    #> [19] <url>\n  <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
    #> [20] <url>\n  <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
    #> ...