Search code examples
rradixtmhttr

bind character vector to list into dataframe


I have a list of URLs and have extracted the content as follows:

library(httr)
link="http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
get.link=GET(link)
get.content=content(x2,as="text")
extract.content=str_extract_all(y2,"<p>(.*?)</p>")

This gives a "list of 1" with text. The length of each list is dependent on/varies with the URL. I would like to bind the URL [link] with the content [extract.content] and transform it into a dataframe and then import that into a Corpus. My attempts fail, eg. this does not work because of the different row lengths:

all=data.frame(url.vec=c(link1,link2),text.vec=c(extract.content1,extract.content2))

Does anyone knows how to combine a character[vector] with a character[list]?


Solution

  • I would do this using XML package. Then you should avoid using regular expression with html/xml documents. Use xpath instead. Here I create a small function that giving a link it create the corpus.

    library(XML)
    create.corpus <- function(link){
      doc <- htmlParse(link)
      parag <- xpathSApply(doc,'//p',xmlValue)
      library(tm)
      cc <- Corpus(VectorSource(parag))
      meta(cc,type='corpus','link') <- link
      cc
    }
    ## call it 
    cc <- create.corpus(link)
    

    Inspecting the result:

     meta(cc,type='corpus')
    # $create_date
    # [1] "2014-01-03 17:40:50 GMT"
    # 
    # $creator
    # [1] ""
    # 
    # $link
    # [1] "http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor"
    
    > cc
    # A corpus with 36 text documents