While using simply lapply read_html page results are retained.
library(xml2)
lapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)})
#> [[1]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-45087 s ...
#>
#> [[2]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-46725 s ...
While using Parallel mclapply:
library(xml2)
library(parallel)
mclapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)}, mc.cores = 2)
#> [[1]]
#> {xml_document}
#>
#> [[2]]
#> {xml_document}
I can't figure out why it's happening, even with foreach I'm not able to get the desired results as normal lapply. Help!
(I mean, you used the word thread so I'm not passing up the opportunity for a pun or three).
Deep in the manual page for ?parallel::mclapply
you'll eventually see that it works by:
You can read ?serialize
to see the method used.
xml_document
/html_document
objects?First, let's make one:
library(xml2)
(doc <- read_html("<p>hi there!</p>"))
## {xml_document}
## <html>
## [1] <body><p>hi there!</p></body>
and look at the str
ucture:
str(doc)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
doc$node
## <pointer: 0x7ff45ab17ce0>
Hrm. Those are <externalptr>
objects. What does ?"externalptr-class"
(eventually) say abt them?
…
"externalptr" # raw external pointers for use in C code
Since it's not a built-in object and the data is hidden away and only accessible via the package interface, R can't serialize it on its own and needs help. (That hex string — 0x7ff45ab17ce0
— is the memory pointer to where this opaque data is hidden).
Totally am.
In the event you're from Missouri (the "Show Me" state), we can see what happens without the complexity of parallel ops and raw connection object serialization machinations by just trying to save the document above to an RDS file and read it back:
tf <- tempfile(fileext = ".rds")
saveRDS(doc, tf)
(doc2 <- readRDS(tf))
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
Now, you may be all like "AHA! See, it works!" Aaaaand…you'd be wrong:
doc2$node
## <pointer: 0x0>
The 0x0
means it's not pointing to anything. You've lost all that data. It's gone. Forever. (But, it had a good run so we should not be too sad abt it).
This has been discussed by the xml2
devs and — rather than make life easier for us — they punted and made ?xml_serialize
.
xml_serialize
but it's kinda not all that useful?Yep. And, it gets even better worse.
Hopefully your curiosity was sufficiently piqued that you went ahead and found out what this quite seriously named xml_serialize()
function does. If not, this is R, so to find out just type it's name without the ()
to get:
function (object, connection, ...)
{
if (is.character(connection)) {
connection <- file(connection, "w", raw = TRUE)
on.exit(close(connection))
}
serialize(structure(as.character(object, ...), class = "xml_serialized_document"),
connection)
}
Apart from wiring up some connection bits, the complex sorcery behind this xml_serialize
function is, well, just as.character()
. (Kind of a let-down, actually.)
Since parallel ops perform (idiomatically) the equivalent of saveRDS()
=> readRDS()
when you return an xml_document
, html_document
(or their _node[s]
siblings) in a parallel apply you eventually get back a whole pile of nothing.
You are left with (at minimum) four choices:
as.character((read_html(…))
to return raw, serializable, character HTML directly from your parallel apply and then re-xml2
them back in the rest of your programxml2
📦, layer in a proper serialization hack and don't bother PR'ing it since you'll likely spend alot of time trying to convince them it's worth it and still end up failing since this "externalptr
serializing` is tricksy business, fraught with peril and you likely missed some edge cases (i.e. Hadley/Jim/etc know what they're doing and if they punted, it's prbly something not worth doing).In reality, rather than use
xml2::read_html()
to grab the content, I'd usehttr::GET()
+httr::content(…, as="text")
instead (if you're being cool and caching the pages vs callously wasting other folks' resources) sinceread_html()
useslibxml2
under the covers and transforms the document (even if sometimes just a little) and it's better to have untransformed raw, cached source data vs something mangled by software that thinks its smarter than we are.
There really isn't any more I can do to clarify this than the above, verbose-mode blathering. Hopefully this expansion also helps others grok what's going on as well.