Search code examples
rurlnlptmreadlines

Cleaning web text using readLines and the tm-package in R


I am trying to remove regex codes and numbers on a webpage using the readLines function. I am using the unlist function for some of this. However, I'm not sure how to remove numbers. I was thinking of using the tm-package, but I seem to be missing a format conversion. How can I transform my webpage to remove numbers etc. with tm, or is there an easier way of removing redundancy from the text? I hope to concatenate a number of webpages to be read, so it will be quite a bit of cleaning.

 library(rvest)
 library(tm)
 webpage <- readLines("https://www.sciencedaily.com/releases/2020/02/200219113746.htm", 
             encoding = "UCS-2LE")
 dirtytext <- unlist(strsplit(webpage,"\\r|\\n|\\t"))
 cleantext <- tm_map(dirtytext,removeNumbers)

The last line gives the error message:

'Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"'


Solution

  • I'm not sure if you want to include the lede but the following returns the story by paragraph (which removes all the non-story elements contained in the text like advertising).

    library(rvest)
    
    url <- "https://www.sciencedaily.com/releases/2020/02/200219113746.htm"
    
    page <- read_html(url)
    
    story <- page %>%
      html_nodes("div#text p") %>%  # use "div#story_text p" to include lede
      html_text