Search code examples
htmlxmlrmeta

parsing meta name/content using xml and r


Regarding the answer to: how to get information within <meta name...> tag in html using htmlParse and xpathSApply

My issue:

html <- htmlParse(domain, useInternalNodes=T);
names <- html['//meta/@name']
content <- html['//meta/@content']

cbind(names, content)

The meta tags in the page are:

<meta name="description" content="blah, blah...." />
<meta name="keywords" content="keyword1, keyword2" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="google-site-verfication" content="1234jalsdkfjasdf928374-293423" />

What I find is this:

 length(names)
[1] 3

length(content)
[1] 4

names                                     content
[1, ] "description"                       [1, ] "blah, blah...."
[2, ] "keywords"                          [2, ] "keyword1, keyword2"
[3, ] "google-site-verification"          [3, ] "text/html; charset=UTF-8"
[4, ] "description"                       [4, ] "1234jalsdkfjasdf928374-293423"

Seems like the parser is tripping up on "http-equiv" and returning the next line of code "google-site-verification" but still returning the "content" for the "http-equiv", and then since there are no more "names" cbind is wrapping around to "description" again to match the last line of content which is the actual "google-site-verification". Seems like a simple fix, by so far any conditional I do does not work, how can I make this right?


Solution

  • I realize you figured out what you needed to (which doesn't really match the original q) but we'll take StackOverflow.com as an example since I had it coded up anyway as an addition to my orignal answer:

    library(XML)
    
    doc <- htmlParse("http://stackoverflow.com/", useInternalNodes=TRUE)
    

    that has the following <meta> tags:

    <meta name="twitter:card" content="summary">
    <meta name="twitter:domain" content="stackoverflow.com"/>
    <meta property="og:type" content="website" />
    <meta property="og:image" itemprop="image primaryImageOfPage" content="http://cdn.sstatic.net/stackoverflow/img/[email protected]?v=fde65a5a78c6" />
    <meta name="twitter:title" property="og:title" itemprop="title name" content="Stack Overflow" />
    <meta name="twitter:description" property="og:description" itemprop="description" content="Q&amp;A for professional and enthusiast programmers" />
    <meta property="og:url" content="http://stackoverflow.com/"/>
    

    Not every tag has a name attribute, in fact of the 7, only 4 do:

    length(doc["//meta/@property"])
    ## [1] 4
    

    Notice that's the same as doing:

    length(xpathSApply(doc, "//meta/@name"))
    ## [1] 4
    

    which is pretty much what's happening under the covers.

    It's only going to come back with only what is true in the search. You can see it more laid out if you do:

    xpathSApply(doc, "//meta", xmlGetAttr, "name")
    
    ## [[1]]
    ## [1] "twitter:card"
    ## 
    ## [[2]]
    ## [1] "twitter:domain"
    ## 
    ## [[3]]
    ## NULL
    ## 
    ## [[4]]
    ## NULL
    ## 
    ## [[5]]
    ## [1] "twitter:title"
    ## 
    ## [[6]]
    ## [1] "twitter:description"
    ## 
    ## [[7]]
    ## NULL
    

    that list, when converted to a vector, truncates to 4 entries due to the NULLs. rvest (original answer` is just "smarter" when it comes to the extractions.

    ORIGINAL ANSWER

    Working with rvest, you can grab all the <meta> attributes into a data frame pretty quickly (if that's what you're trying to do):

    library(rvest)
    library(dplyr)
    
    pg <- html("http://facebook.com/")
    
    all_meta_attrs <- unique(unlist(lapply(lapply(pg %>% html_nodes("meta"), html_attrs), names)))
    
    dat <- data.frame(lapply(all_meta_attrs, function(x) {
      pg %>% html_nodes("meta") %>% html_attr(x)
    }))
    
    colnames(dat) <- all_meta_attrs
    
    glimpse(dat)
    
    ## Observations: 19
    ## Variables:
    ## $ charset    (fctr) utf-8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
    ## $ http-equiv (fctr) NA, refresh, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
    ## $ content    (fctr) NA, 0; URL=/?_fb_noscript=1, default, Facebook, h...
    ## $ name       (fctr) NA, NA, referrer, NA, NA, NA, NA, NA, NA, NA, NA,...
    ## $ id         (fctr) NA, NA, meta_referrer, NA, NA, NA, NA, NA, NA, NA...
    ## $ property   (fctr) NA, NA, NA, og:site_name, og:url, og:image, og:lo...
    

    but it will also reliably extract the attributes for you:

    pg %>% html_nodes("meta") %>% html_attr("http-equiv")
    
    ##  [1] NA                "refresh"         NA               
    ##  [4] NA                NA                NA               
    ##  [7] NA                NA                NA               
    ## [10] NA                NA                NA               
    ## [13] NA                NA                NA               
    ## [16] NA                NA                NA               
    ## [19] "X-Frame-Options"