Search code examples
rweb-scrapingdplyrrvest

Scraping specific details/columns in textual data (rvest)


I am relatively new to webscraping and I am interested in scraping textual data from an online social forum. I was able to successfully scrape text but I am unable to organize and gather specific details from the textual data.

Currently, my code is as follows:

library(tidyverse)
library(rvest)


# Scrape posts 
pages <- 1:32

hardwarezone_list=list()

for(i in seq_along(pages)){  hardwarezone_link<-paste0("https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/","page-",i)
hardwarezone_page<-read_html(hardwarezone_link)  
hardwarezone_list[[i]] <- hardwarezone_page  %>% html_nodes(".bbWrapper")  %>% html_text()}
hardwarezone_table <- do.call(rbind,hardwarezone_list)
hardwarezone_table<- as.data.frame(hardwarezone_table)
#print data example

dput(hardwarezone_table[1:2,c(1,2)])

# output:
structure(list(V1 = c(" https://www.channelnewsasia.com/ne...bs-restaurant-association-13441340?cid=FBcna \n\"You can see that F&B jobs are really not on top of the minds of Singaporeans even when there's high unemployment,\" says a business owner.", 
"I guesss majority prefer to either send food or eat food .. not prepare the food. Haha"
), V2 = c("Recession and retrenchment only happen in EDMW ", 
"\n\t\n\t\t\n\t\t\t\n\t\t\t\ttokong said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\tno thanks, those people whose pop and mom are hawkers or have been hawkers will know. \nour parents will discourage us to become hawkers. better study hard and get a job.\nf and b jobs generate no values to your cv unless it is the end of the road for you.\nf and b pay is very jialat also. if the salary cannot feed your own family, why take the job?\nthose young punks who go into f and b either has the passion or enjoys the freedom of being not an employee\n\t\t\n\t\tClick to expand...\n\t\n\nyou will be shocked how much hawkers earn. even just those drink stall make kopi, teh kind and get soft drinks, ice from supplier and sell. don't mention bubble tea that one is considered quite artisanal.\nf&b has many positions, les amis executive chef also f&b, waitress also f&b, george quek also f&b. the value of CV is dependent on how a person wanna craft his career path, and not the industry."
)), row.names = 1:2, class = "data.frame")

However, ideally, I would like to scrape the data where each row/observation contains the following information, rather than just collecting data on posts which is the case with my code above.

username        post                                    date                     user status
tegridy_farm            why is that the case.                  3/10/2022               banned
Mackey                 why                                   3/10/2022             Senior member
eric cartman         kyle is bad                      3/10/2022             banned

I implemented the nice solution below as follows, which worked perfectly.

hardwarezone_scraper <- function(page_number) {
    # our base URL
    hardwarezone_link<-"https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-{page_number}"
    
    # read the html for each post
    messages <- read_html(glue::glue(hardwarezone_link)) %>%
    html_nodes(".message-inner")

    # get the information we want
    usernames <- messages %>%
    html_nodes(".message-name") %>%
        html_text()

    user_status <- messages %>% 
        html_nodes(".message-userTitle") %>%
        html_text()

    post_date <- messages %>%
        html_nodes(".listInline") %>%
        html_nodes(".u-dt") %>%
        html_text() %>%
        # example is "Nov 4, 2020"
        parse_date(format = "%b %d, %Y")

    post <- messages %>%
        html_nodes(".bbWrapper") %>%
        html_text()
    # combine into a dataframe and return
    tibble(
        username = usernames,
        post = post,
        date = post_date,
        `user status` = user_status
    )
}
hardwarezone_scraper(1)

Solution

  • Here is how I would do it:

    # load the required packages (and if you don't have them installed, then they are installed and loaded automatically)
    pacman::p_load(tidyverse, rvest, glue)
    
    hardwarezone_scraper <- function(page_number) {
        # our base URL
        hardwarezone_link<-"https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-{page_number}"
        
        # read the html for each post
        messages <- read_html(glue::glue(hardwarezone_link)) %>%
        html_nodes(".message-inner")
    
        # get the information we want
        usernames <- messages %>%
        html_nodes(".message-name") %>%
            html_text()
    
        user_status <- messages %>% 
            html_nodes(".message-userTitle") %>%
            html_text()
    
        post_date <- messages %>%
            html_nodes(".listInline") %>%
            html_nodes(".u-dt") %>%
            html_text() %>%
            # example is "Nov 4, 2020"
            parse_date(format = "%b %d, %Y")
    
        post <- messages %>%
            html_nodes(".bbWrapper") %>%
            html_text()
        
        # combine into a dataframe and return
        tibble(
            username = usernames,
            post = post,
            date = post_date,
            `user status` = user_status
        )
    }
    hardwarezone_scraper(1)
    

    The trick for scraping (if there is a trick) is to use the Chrome (or Firefox) inspector to look at the elements of the webpage, and find the identifiers for the elements you want- often similar things have the same class, as was the case here.