I already have this working but looking to optimize this. It takes a really long time to extract the article data related to this because my methodology is using a for-loop. I need to go row-by-row and it takes alittle more than a second to run each row. However, in my actual dataset I have about 10,000 rows and it is taking a really long time. Is there a way to extract the full article other than a for-loop? I am doing the same methodology for every row so I'm wondering if there is a function in R similar to like multiplying a column by a number which is super quick.
Creation of dummy dataset:
date<- as.Date(c('2020-06-25', '2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25'))
text <- c('Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays',
'GMRC now a law; to be integrated in school curriculum',
'QC to impose stringent measures to screen applicants for PWD ID',
'‘Baka kalaban ka:’ Cops intimidate dzBB reporter',
'Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so',
'PNP records highest single-day COVID-19 tally as cases rise to 579',
'IBP tells new lawyers: ‘Excel without sacrificing honor’',
'Senators express concern over DepEd’s preparedness for upcoming school year',
'Angara calls for probe into reported spread of ‘fake’ PWD IDs',
'Grab PH eyes new scheme to protect food couriers vs no-show customers')
link<- c('https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays',
'https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum',
'https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id',
'https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter',
'https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so',
'https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579',
'https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor',
'https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year',
'https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids',
'https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers')
df<-data.frame(date, text, link)
Dummy dataset:
df
date text link
1 2020-06-25 Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays
2 2020-06-25 GMRC now a law; to be integrated in school curriculum https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum
3 2020-06-25 QC to impose stringent measures to screen applicants for PWD ID https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id
4 2020-06-25 ‘Baka kalaban ka:’ Cops intimidate dzBB reporter https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter
5 2020-06-25 Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so
6 2020-06-25 PNP records highest single-day COVID-19 tally as cases rise to 579 https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579
7 2020-06-25 IBP tells new lawyers: ‘Excel without sacrificing honor’ https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor
8 2020-06-25 Senators express concern over DepEd’s preparedness for upcoming school year https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year
9 2020-06-25 Angara calls for probe into reported spread of ‘fake’ PWD IDs https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids
10 2020-06-25 Grab PH eyes new scheme to protect food couriers vs no-show customers https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers
Code to get article data for every link:
now<-Sys.time()
for(i in 1:nrow(df)) {
test_article<- read_html(df[i, 3]) %>%
html_nodes(".article_align div p") %>%
html_text() %>%
toString()
text_df <- tibble(test_article)
df[i,4]<-test_article
print(paste(i,"/",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now
So just for 10 articles, it took 10 seconds which I feel like is really long. Looking to see if there is a faster way to do this.
You can parallelize the loop :
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
now <- Sys.time()
result <- foreach(i =1:nrow(df),.combine=rbind,.packages=('dplyr','rvest') %dopar% {
test_article <- read_html(df[i, 3]) %>%
html_nodes(".article_align div p") %>%
html_text() %>%
toString()
data.frame( test_article = test_article, ID = paste(i,"-",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now
#stop cluster
stopCluster(cl)
Note that you can't write into the original dataframe from inside the foreach loop because each task runs in a separate environment.