Search code examples
rweb-scrapingdoparallelxml2

HTML pages not retained in list while using mclapply


While using simply lapply read_html page results are retained.

library(xml2)  

lapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)})
#> [[1]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-45087 s ...
#> 
#> [[2]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-46725 s ...

While using Parallel mclapply:

library(xml2)
library(parallel)  

mclapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)}, mc.cores = 2)
#> [[1]]
#> {xml_document}
#> 
#> [[2]]
#> {xml_document}

I can't figure out why it's happening, even with foreach I'm not able to get the desired results as normal lapply. Help!


Solution

  • Time to sew

    (I mean, you used the word thread so I'm not passing up the opportunity for a pun or three).

    Deep in the manual page for ?parallel::mclapply you'll eventually see that it works by:

    • forking processes
    • serializing results
    • eventually gathering up these serialized results and combining them into one object

    You can read ?serialize to see the method used.

    Why cant we serialize xml_document/html_document objects?

    First, let's make one:

    library(xml2)
    
    (doc <- read_html("<p>hi there!</p>"))
    ## {xml_document}
    ## <html>
    ## [1] <body><p>hi there!</p></body>
    

    and look at the structure:

    str(doc)
    ## List of 2
    ##  $ node:<externalptr> 
    ##  $ doc :<externalptr> 
    ##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
    
    doc$node
    ## <pointer: 0x7ff45ab17ce0>
    

    Hrm. Those are <externalptr> objects. What does ?"externalptr-class" (eventually) say abt them?

    …
    "externalptr" # raw external pointers for use in C code
    

    Since it's not a built-in object and the data is hidden away and only accessible via the package interface, R can't serialize it on its own and needs help. (That hex string — 0x7ff45ab17ce0 — is the memory pointer to where this opaque data is hidden).

    "You can't be serious…"

    Totally am.

    In the event you're from Missouri (the "Show Me" state), we can see what happens without the complexity of parallel ops and raw connection object serialization machinations by just trying to save the document above to an RDS file and read it back:

    tf <- tempfile(fileext = ".rds")
    saveRDS(doc, tf)
    
    (doc2 <- readRDS(tf))
    ## List of 2
    ##  $ node:<externalptr> 
    ##  $ doc :<externalptr> 
    ##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
    

    Now, you may be all like "AHA! See, it works!" Aaaaand…you'd be wrong:

    doc2$node
    ## <pointer: 0x0>
    

    The 0x0 means it's not pointing to anything. You've lost all that data. It's gone. Forever. (But, it had a good run so we should not be too sad abt it). This has been discussed by the xml2 devs and — rather than make life easier for us — they punted and made ?xml_serialize.

    Wait…there's an xml_serialize but it's kinda not all that useful?

    Yep. And, it gets even better worse.

    Hopefully your curiosity was sufficiently piqued that you went ahead and found out what this quite seriously named xml_serialize() function does. If not, this is R, so to find out just type it's name without the () to get:

    function (object, connection, ...) 
    {
        if (is.character(connection)) {
            connection <- file(connection, "w", raw = TRUE)
            on.exit(close(connection))
        }
        serialize(structure(as.character(object, ...), class = "xml_serialized_document"), 
            connection)
    }
    

    Apart from wiring up some connection bits, the complex sorcery behind this xml_serialize function is, well, just as.character(). (Kind of a let-down, actually.)

    Since parallel ops perform (idiomatically) the equivalent of saveRDS() => readRDS() when you return an xml_document, html_document (or their _node[s] siblings) in a parallel apply you eventually get back a whole pile of nothing.

    What can a content thief innocent scraper do to overcome this devastating limitation?

    You are left with (at minimum) four choices:

    • 🤓 expand the complexity of your function in the parallel apply to process the XML/HTML document into a data frame, vector or list of objects that can all be serialized automagically by R so they can be combined for you
    • be cool 😎 and have one parallel apply that saves off the HTML into files (the HTTP ops are likely the slow bit anyway) and then a non-parallel operation that read them sequentially and processes them — which it looks like you were going to do anyway. Note that you're kind of being a leech and rly bad netizen if you don't do the HTML caching to file anyway since you're showing you really don't care about the bandwidth and CPU costs of the content you're purloining scraping.
    • don't be cool by doing ^^ 😔 and, instead, use as.character((read_html(…)) to return raw, serializable, character HTML directly from your parallel apply and then re-xml2 them back in the rest of your program
    • 😱 fork the xml2 📦, layer in a proper serialization hack and don't bother PR'ing it since you'll likely spend alot of time trying to convince them it's worth it and still end up failing since this "externalptr serializing` is tricksy business, fraught with peril and you likely missed some edge cases (i.e. Hadley/Jim/etc know what they're doing and if they punted, it's prbly something not worth doing).

    In reality, rather than use xml2::read_html() to grab the content, I'd use httr::GET() + httr::content(…, as="text") instead (if you're being cool and caching the pages vs callously wasting other folks' resources) since read_html() uses libxml2 under the covers and transforms the document (even if sometimes just a little) and it's better to have untransformed raw, cached source data vs something mangled by software that thinks its smarter than we are.

    FIN

    There really isn't any more I can do to clarify this than the above, verbose-mode blathering. Hopefully this expansion also helps others grok what's going on as well.