Search code examples
rurlweb-scrapingxml2

Error in open.connection(x, "rb") : HTTP error 404, with the read_html function


I got the following error when using the read_html function from the xml2 package:

Error in open.connection(x, "rb") : HTTP error 404.

Here is the URL I attempted to read:

xml2::read_html("https://www.act.is/media-centre/press-releases/actis-energy-platform-zuma-energía-reaches-financial-close-on-two-further-solar-farms-in-mexico/")

By contrast, no error was generated when reading this URL

xml2::read_html("https://www.act.is/media-centre/press-releases/actis-wins-cio-magazine-s-real-asset-award/")

The first URL contains a word with an accent mark "energía", the second URL does not. Is it possible to read URLs containing words with accent marks?


Solution

  • There're special characters in the URL and you have to escape them. In Python there's HTTP libraries for that, for the R you can find here

    Python expamle:

    base_url = "https://www.act.is/media-centre/press-releases/"
    encoded_url = requests.utils.quote("actis-energy-platform-zuma-energía-reaches-financial-close-on-two-further-solar-farms-in-mexico/")
    response = requests.get(base_url + encoded_url)
    

    Encoded URL:

    https://www.act.is/media-centre/press-releases/actis-energy-platform-zuma-energ%C3%ADa-reaches-financial-close-on-two-further-solar-farms-in-mexico/