I am trying to remove regex codes and numbers on a webpage using the readLines
function. I am using the unlist
function for some of this. However, I'm not sure how to remove numbers. I was thinking of using the tm-package
, but I seem to be missing a format conversion. How can I transform my webpage to remove numbers etc. with tm, or is there an easier way of removing redundancy from the text? I hope to concatenate a number of webpages to be read, so it will be quite a bit of cleaning.
library(rvest)
library(tm)
webpage <- readLines("https://www.sciencedaily.com/releases/2020/02/200219113746.htm",
encoding = "UCS-2LE")
dirtytext <- unlist(strsplit(webpage,"\\r|\\n|\\t"))
cleantext <- tm_map(dirtytext,removeNumbers)
The last line gives the error message:
'Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"'
I'm not sure if you want to include the lede but the following returns the story by paragraph (which removes all the non-story elements contained in the text like advertising).
library(rvest)
url <- "https://www.sciencedaily.com/releases/2020/02/200219113746.htm"
page <- read_html(url)
story <- page %>%
html_nodes("div#text p") %>% # use "div#story_text p" to include lede
html_text