Search code examples
rweb-scrapingdplyrrvestlubridate

"Date" column imported incorrectly after webscraping (Rvest)


I am trying to scrape multiple links/sources from an online social forum, but the posts come from different dates. For instance, one forum topic might open in Dec 2020, while another is in July 2021, and it's crucial for me to organize online posts chronologically.

# Load the required libraries
library(tidyverse)
library(rvest)
library(writexl)
library(purrr)
library(pacman)
library(httr)
library(lubridate)
library(readr) 
library(zoo)
#install.packages("pacman")
#install.packages("tidyverse")

Initialize the vectors to store the scraped data

username <- vector() 
post <- vector() 
date <- vector()
user_status <- vector()

The scraping code below works well w/out errors, but for some reason the "date" variable shows all dates from 2021-09-21 onwards, which is incorrect because the date for the topic under url_2 below is from November 2020, so I assume the dataset should start from the social media post written on Nov 2020, rather than Sept 2021.

#organize the data as follows: username, post, date, and user status.

# Loop through the pages of the forum thread
for (i in 1:100) {
# Construct the url for sources
  url_1 <- paste0("https://forums.hardwarezone.com.sg/threads/companies-may-exit-singapore-if-they-do-not-have-access-to-the-complementary-foreign-manpower-they-need-tan-see-leng.6817819/page-", i)  
  
    url_2 <- paste0("https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-", i) 
  
   url_3 <- paste0("https://forums.hardwarezone.com.sg/threads/disappointing-hard-truth-the-singaporean-worker-is-more-expensive-than-ft-coz-of-cpf-even-if-paid-same-wages-from-mom-data.6493727/page-", i)   

# Get the html content of the all sources
  page1 <- GET(url_1) 
   page2 <- GET(url_2) 
   page3 <- GET(url_3) 


 Parse the html content
  soup <- read_html(page1) 
  soup <- read_html(page2) 
  soup <- read_html(page3) 

# Extract the section containing the messages
  section <- html_nodes(soup, "article.message")
  
  # Loop through each message in the section
  for (j in section) {
    # Append the username of the message author to the username vector
    username <- c(username, html_text(html_node(j, "a.username"))) 
    
    # Append the post content of the message to the post vector
    post <- c(post, html_text(html_node(j, "div.bbWrapper")))
    
    # Extract the date string of the message
    date_str <- html_text(html_node(j, "time.u-dt"))
    
    # Check if the date string is not empty
    if (date_str != "") { 
      # Convert the date string to a date object and append it to the date vector
      date <- c(date, as.Date(date_str, format = "%b %d, %Y"))
    } else {
      # If the date string is empty, append NA to the date vector
      date <- c(date, NA) 
    }
    
    # Append the user status of the message author to the user_status vector
    user_status <- c(user_status, html_text(html_node(j, "h5.userTitle.message-userTitle")))
  }
}

Create a data frame from the vectors

hardwarezone_posts <- data.frame(username, post, date, user_status)
# Format the date column as a date object
hardwarezone_posts$date <- format(as.Date(hardwarezone_posts$date, origin = "1970-01-01"), "%d/%m/%Y")

#print data example

dput(hardwarezone_posts[1:6,c(1,2,3)])

output:

structure(list(username = c("jonesftw", "matrix05", "whitecabbage", 
"walceab", "jonesftw", "Ianyhowtalk"), post = c("\n\t\n\t\n\t\t\n\t\t\n\t\t\tOn 14 September 2021, the Ministry of Manpower (MOM) mounted a 12-hour long enforcement operation at 22 locations island-wide as part of an investigation involving a syndicate suspected of bringing foreigners into Singapore on work passes obtained through false declarations. A total of 18 persons were arrested. The investigation is ongoing.\nModus Operandi\n2MOM began its investigations in July 2021 upon obtaining information of a foreigner’s attempts to acquire a work pass illegally. Through detailed analyses over a few months, MOM uncovered a potential syndicate suspected of setting up several shell companies to apply for work passes, even though they had no legitimate business operations.\n3Such syndicates typically recruit Singapore citizens and Singapore permanent residents to receive CPF contributions as “phantom local workers” in order to illegally inflate the companies’ quota to hire foreigners. Based on the inflated quota, the companies would apply for work passes for the foreigners through false declarations and collect kickbacks from them. These foreigners would then enter and remain in Singapore via these illegally obtained work passes. These practices undermine the integrity of our work pass framework.\nPenalties\n4Under the Employment of Foreign Manpower Act (EFMA), individuals convicted of obtaining work passes for a business that does not exist, is not in operation, or does not require the employment of foreigners may be liable to a fine not exceeding $6,000, imprisonment for up to two years, or both, per charge. If convicted for six or more charges, caning will also be imposed.\n5 Employers who hire foreigners seeking illegal employment may be liable to a fine not exceeding $30,000, imprisonment for up to 12 months, or both, per charge. Upon conviction, they will be barred from employing foreigners.\n6Foreigners who undertake employment without a valid work pass may be liable to a fine not exceeding $20,000, imprisonment for up to two years, or both. Upon conviction, they will be permanently barred from working in Singapore.\n7Members of the public who are aware of suspicious employment activities such as companies employing foreigners without valid work passes, persons receiving CPF contributions from unknown companies, or know of persons or employers who contravene the EFMA should report the matter to MOM at 64385122 or [email protected]. All information will be kept strictly confidential.\n\t\t\n\t\tClick to expand...\n\t\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t18 Arrested for Suspected Illegal Labour Importation\n\t\t\t\t\t\n\t\t\t\t\n\n\t\t\t\tOn 14 September 2021, the Ministry of Manpower (MOM) mounted a 12-hour long enforcement operation at 22 locations island-wide as part of an investigation involving a syndicate suspected of bringing foreigners into Singapore on work passes obtained...\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\twww.mom.gov.sg\n\t\t\t\t\n\t\t\t\n\t\t\n\t", 
"Dr Tan is doing enforcement. Jo closes 1( or 3) eyes", "All these 18 people should face 24 strokes of the cane EVERYDAY for the rest of their lives for being traitors", 
"you know what? this practice had been ongoing since many donkey years ago so why they only acting now? cause of PSP debate that brings fire to PAP doorsteps then they must act act a bit? I think whole MOM should be sacked, acting blur for so many donkey years showing that they are lacking the skills and trust of citizens. Why am i paying tax to pay their high salaries?", 
"\n\t\t\t{\n\t\t\t\t\"lightbox_close\": \"Close\",\n\t\t\t\t\"lightbox_next\": \"Next\",\n\t\t\t\t\"lightbox_previous\": \"Previous\",\n\t\t\t\t\"lightbox_error\": \"The requested content cannot be loaded. Please try again later.\",\n\t\t\t\t\"lightbox_start_slideshow\": \"Start slideshow\",\n\t\t\t\t\"lightbox_stop_slideshow\": \"Stop slideshow\",\n\t\t\t\t\"lightbox_full_screen\": \"Full screen\",\n\t\t\t\t\"lightbox_thumbnails\": \"Thumbnails\",\n\t\t\t\t\"lightbox_download\": \"Download\",\n\t\t\t\t\"lightbox_share\": \"Share\",\n\t\t\t\t\"lightbox_zoom\": \"Zoom\",\n\t\t\t\t\"lightbox_new_window\": \"New window\",\n\t\t\t\t\"lightbox_toggle_sidebar\": \"Toggle sidebar\"\n\t\t\t}\n\t\t\t\n\t\t\n\n\n\n\t\t\n\n\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t18 people in Singapore arrested for illegal labour importation\n\t\t\t\t\t\n\t\t\t\t\n\n\t\t\t\tInvestigation involves syndicate suspected of bringing foreigners here on work passes obtained through false declarations. Read more at straitstimes.com.\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\twww.straitstimes.com\n\t\t\t\t\n\t\t\t\n\t\t\n\t", 
"Last time don’t raid, last year don’t raid. Why raid now.\ngo google and search \nwork permit singapore \neasy PR singapore\nPR singapore\na lot can be find, a lot can be catch. Why only now. ???"
), date = structure(c(18891, 18891, 18891, 18891, 18891, 18891
), class = "Date")), row.names = c(NA, 6L), class = "data.frame")

Solution

  • If this is the actual code, the for-loop is generating 300 requests, 100 per each forum thread, url_1, url_2 & url_3, requesting pages 1 ... 100 of each of those threads. Parsing is only applied to pages of url_3 as the object storing the page content, soup, gets overwritten (twice) in the early stages of each cycle. Meaning that all your collected posts are from a single thread.

    As you generate URLs for 100 pages, have you checked what happens when you pass the last page?
    When we take url_3 as an example, it currently lists 10 pages, 20 posts per page + reminder on 10th. When your request goes beyond the last page, e.g. you request the 100th in the final cycle of that for-loop ( forums.hardwarezone.com.sg/threads/.../page-100 ), what you get back instead is the actual last page, .../page-10. Meaning that your resulting dataset is not just from a single thread, but posts on the 10th page are repeated 91 times (responses for loop cycles 10 .. 100 are identical).


    Answer I left for your previous question on the same topic extracts pagination details from page content to avoid such issues. It was also tested and proved fully reproducible, though only with up-to-date packages. So if you had issues with map(), you might want to consider updating packages.