I am relatively new to webscraping and I am interested in scraping textual data from an online social forum. I was able to successfully scrape text but I am unable to organize and gather specific details from the textual data.
Currently, my code is as follows:
library(tidyverse)
library(rvest)
# Scrape posts
pages <- 1:32
hardwarezone_list=list()
for(i in seq_along(pages)){ hardwarezone_link<-paste0("https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/","page-",i)
hardwarezone_page<-read_html(hardwarezone_link)
hardwarezone_list[[i]] <- hardwarezone_page %>% html_nodes(".bbWrapper") %>% html_text()}
hardwarezone_table <- do.call(rbind,hardwarezone_list)
hardwarezone_table<- as.data.frame(hardwarezone_table)
#print data example
dput(hardwarezone_table[1:2,c(1,2)])
# output:
structure(list(V1 = c(" https://www.channelnewsasia.com/ne...bs-restaurant-association-13441340?cid=FBcna \n\"You can see that F&B jobs are really not on top of the minds of Singaporeans even when there's high unemployment,\" says a business owner.",
"I guesss majority prefer to either send food or eat food .. not prepare the food. Haha"
), V2 = c("Recession and retrenchment only happen in EDMW ",
"\n\t\n\t\t\n\t\t\t\n\t\t\t\ttokong said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\tno thanks, those people whose pop and mom are hawkers or have been hawkers will know. \nour parents will discourage us to become hawkers. better study hard and get a job.\nf and b jobs generate no values to your cv unless it is the end of the road for you.\nf and b pay is very jialat also. if the salary cannot feed your own family, why take the job?\nthose young punks who go into f and b either has the passion or enjoys the freedom of being not an employee\n\t\t\n\t\tClick to expand...\n\t\n\nyou will be shocked how much hawkers earn. even just those drink stall make kopi, teh kind and get soft drinks, ice from supplier and sell. don't mention bubble tea that one is considered quite artisanal.\nf&b has many positions, les amis executive chef also f&b, waitress also f&b, george quek also f&b. the value of CV is dependent on how a person wanna craft his career path, and not the industry."
)), row.names = 1:2, class = "data.frame")
However, ideally, I would like to scrape the data where each row/observation contains the following information, rather than just collecting data on posts which is the case with my code above.
username post date user status
tegridy_farm why is that the case. 3/10/2022 banned
Mackey why 3/10/2022 Senior member
eric cartman kyle is bad 3/10/2022 banned
I implemented the nice solution below as follows, which worked perfectly.
hardwarezone_scraper <- function(page_number) {
# our base URL
hardwarezone_link<-"https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-{page_number}"
# read the html for each post
messages <- read_html(glue::glue(hardwarezone_link)) %>%
html_nodes(".message-inner")
# get the information we want
usernames <- messages %>%
html_nodes(".message-name") %>%
html_text()
user_status <- messages %>%
html_nodes(".message-userTitle") %>%
html_text()
post_date <- messages %>%
html_nodes(".listInline") %>%
html_nodes(".u-dt") %>%
html_text() %>%
# example is "Nov 4, 2020"
parse_date(format = "%b %d, %Y")
post <- messages %>%
html_nodes(".bbWrapper") %>%
html_text()
# combine into a dataframe and return
tibble(
username = usernames,
post = post,
date = post_date,
`user status` = user_status
)
}
hardwarezone_scraper(1)
Here is how I would do it:
# load the required packages (and if you don't have them installed, then they are installed and loaded automatically)
pacman::p_load(tidyverse, rvest, glue)
hardwarezone_scraper <- function(page_number) {
# our base URL
hardwarezone_link<-"https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-{page_number}"
# read the html for each post
messages <- read_html(glue::glue(hardwarezone_link)) %>%
html_nodes(".message-inner")
# get the information we want
usernames <- messages %>%
html_nodes(".message-name") %>%
html_text()
user_status <- messages %>%
html_nodes(".message-userTitle") %>%
html_text()
post_date <- messages %>%
html_nodes(".listInline") %>%
html_nodes(".u-dt") %>%
html_text() %>%
# example is "Nov 4, 2020"
parse_date(format = "%b %d, %Y")
post <- messages %>%
html_nodes(".bbWrapper") %>%
html_text()
# combine into a dataframe and return
tibble(
username = usernames,
post = post,
date = post_date,
`user status` = user_status
)
}
hardwarezone_scraper(1)
The trick for scraping (if there is a trick) is to use the Chrome (or Firefox) inspector to look at the elements of the webpage, and find the identifiers for the elements you want- often similar things have the same class, as was the case here.