Search code examples
rfor-loopapplylapplysapply

Alternative to for-loop needed to optimize speed of working script


I already have this working but looking to optimize this. It takes a really long time to extract the article data related to this because my methodology is using a for-loop. I need to go row-by-row and it takes alittle more than a second to run each row. However, in my actual dataset I have about 10,000 rows and it is taking a really long time. Is there a way to extract the full article other than a for-loop? I am doing the same methodology for every row so I'm wondering if there is a function in R similar to like multiplying a column by a number which is super quick.

Creation of dummy dataset:

date<- as.Date(c('2020-06-25', '2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25'))

text <- c('Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays', 
      'GMRC now a law; to be integrated in school curriculum',
      'QC to impose stringent measures to screen applicants for PWD ID',
      '‘Baka kalaban ka:’ Cops intimidate dzBB reporter',
      'Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so',
      'PNP records highest single-day COVID-19 tally as cases rise to 579',
      'IBP tells new lawyers: ‘Excel without sacrificing honor’',
      'Senators express concern over DepEd’s preparedness for upcoming school year',
      'Angara calls for probe into reported spread of ‘fake’ PWD IDs',
      'Grab PH eyes new scheme to protect food couriers vs no-show customers')
link<- c('https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays',  
     'https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum',                           
     'https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id',                 
     'https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter',                                  
     'https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so',
     'https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579',             
     'https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor',                         
     'https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year',                      
     'https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids',                   
     'https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers')

df<-data.frame(date, text, link)

Dummy dataset:

df
         date                                                                         text                                                 link
1  2020-06-25 Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays   https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays
2  2020-06-25                        GMRC now a law; to be integrated in school curriculum   https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum
3  2020-06-25              QC to impose stringent measures to screen applicants for PWD ID   https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id
4  2020-06-25                             ‘Baka kalaban ka:’ Cops intimidate dzBB reporter   https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter
5  2020-06-25      Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so   https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so
6  2020-06-25           PNP records highest single-day COVID-19 tally as cases rise to 579   https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579
7  2020-06-25                     IBP tells new lawyers: ‘Excel without sacrificing honor’   https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor
8  2020-06-25  Senators express concern over DepEd’s preparedness for upcoming school year   https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year
9  2020-06-25                Angara calls for probe into reported spread of ‘fake’ PWD IDs   https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids
10 2020-06-25        Grab PH eyes new scheme to protect food couriers vs no-show customers   https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers

Code to get article data for every link:

now<-Sys.time()
for(i in 1:nrow(df)) {
  test_article<- read_html(df[i, 3]) %>% 
    html_nodes(".article_align div p") %>% 
    html_text() %>%
    toString() 

  text_df <- tibble(test_article)
  df[i,4]<-test_article
  print(paste(i,"/",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now

So just for 10 articles, it took 10 seconds which I feel like is really long. Looking to see if there is a faster way to do this.


Solution

  • You can parallelize the loop :

    #setup parallel backend to use many processors
    cores=detectCores()
    cl <- makeCluster(cores[1]-1) #not to overload your computer
    registerDoParallel(cl)
    now <- Sys.time()
    result <- foreach(i =1:nrow(df),.combine=rbind,.packages=('dplyr','rvest') %dopar% { 
      test_article <- read_html(df[i, 3]) %>% 
        html_nodes(".article_align div p") %>% 
        html_text() %>%
        toString() 
      
      data.frame( test_article = test_article, ID = paste(i,"-",nrow(df), sep = ""))
      }
    
    finish<-Sys.time()
    finish-now
    #stop cluster
    stopCluster(cl)
    

    Note that you can't write into the original dataframe from inside the foreach loop because each task runs in a separate environment.