R code optimization: For loop and writing to a database

I am trying to optimize a simple R code I wrote on two aspects:

1) For loops

2) Writing data into my PostgreSQL database

For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.

For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)

EDIT: I updated my code with a reproducible example below.

for (i in 1:100){

   search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)

   download.file(search, destfile ='scrape.html',quiet = TRUE)

   url <- read_html('scrape.html')

   github_title <- url%>%html_nodes(xpath="//div[@class=mt-n1]")%>%html_text()

   github_link <- url%>%html_nodes(xpath="//div[@class=mt-n1]//@href")%>%html_text()

   df <- data.frame(github_title, github_link )

   colnames(df) <- c("title", "link")

   dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)

   cat(i)
}

Thanks a lot for all your inputs!

Solution

First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.

I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.

n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
  id <- element_a[i]
  location <- element_b[i]
  language <- element_c[i]
  date_creation <- element_d[i]
  df <- data.frame(id, location, language, date_creation)
  colnames(df) <- c("id", "location", "language", "date_creation")
  outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con)   #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con)  #<= might speed up might not.

Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.

#Create the statements. here 
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))

##dbBegin(con)   #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
  SET NOCOCUNT ON seems to be necessary in the DBI API.
  It seems to react to 'n rows affected' messages. 
  Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON 
INSERT INTO [my table] values ', statement))
##dbCommit(con)   #<= might speed up might not.

Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.

Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.