I am trying to download a file from a URL using the polite
package in R. Here is the code I am using:
library(polite)
# URL of the file to download
eprice_xml_products_1 <- "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz"
# Create a polite session
session <- bow(eprice_xml_products_1)
# Download the file using rip function
file_path <- rip(session, destfile = "xml_1.gz")
print(file_path)
I have also tried with this function:
bow(eprice_xml_products_1) %>%
nod("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz") %>%
rip()
But I get this error:
trying URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
Error in fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", :
cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz'
In addition: Warning messages:
1: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", :
downloaded length 0 != reported length 334
2: In fun(url = "https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", :
cannot open URL 'https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz': HTTP status was '403 Forbidden'
If I just open the link with my browser the download of the file starts immediately
What am I missing?
That page blocks requests for the url you are trying to access, when the user-agent
value in the request headers is not a regular browser (Firefox, Chrome, ...). To make this work, you can change your user agent value to that of a Browser. Below is an example that works with utils::download.file()
. A similar strategy might be available for polite
.
# Set User Agent to current Firefox
options(HTTPUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/115.0")
download.file("https://www.eprice.it/sitemap/https/Sitemap_Elettrodomestici_1.xml.gz", "Sitemap_Elettrodomestici_1.xml.gz")
# Load XML from file
library(xml2)
read_xml("Sitemap_Elettrodomestici_1.xml.gz")
#> {xml_document}
#> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
#> [1] <url>\n <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#> [2] <url>\n <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#> [3] <url>\n <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DFrigorif ...
#> [4] <url>\n <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DHAIER%2DHaier%2D ...
#> [5] <url>\n <loc>https://www.eprice.it/3%2De%2D4%2DPorte%2DMIDEA%2DFrigorif ...
#> [6] <url>\n <loc>https://www.eprice.it/Accessori%2DFrigoriferi%2DELECTROLUX ...
#> [7] <url>\n <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D1597166</loc ...
#> [8] <url>\n <loc>https://www.eprice.it/accessori%2DIMPERIA/d%2D2489361</loc ...
#> [9] <url>\n <loc>https://www.eprice.it/accessori%2Dincasso%2DDe%20Longhi/d% ...
#> [10] <url>\n <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
#> [11] <url>\n <loc>https://www.eprice.it/accessori%2Dincasso%2DELECTROLUX/d%2 ...
#> [12] <url>\n <loc>https://www.eprice.it/accessori%2Dincasso%2DELUX%20INC/d%2 ...
#> [13] <url>\n <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D5551714</loc ...
#> [14] <url>\n <loc>https://www.eprice.it/accessori%2DKENWOOD/d%2D7625838</loc ...
#> [15] <url>\n <loc>https://www.eprice.it/accessori%2DKitchenAid/d%2D50118434< ...
#> [16] <url>\n <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [17] <url>\n <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [18] <url>\n <loc>https://www.eprice.it/Accessori%2Dmacchine%2Dcaffe%2DBIA%2 ...
#> [19] <url>\n <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
#> [20] <url>\n <loc>https://www.eprice.it/accessori%2Dmacchine%2Dcaffe%2DDE%20 ...
#> ...