Search code examples
rpostgresqlfor-looplapplyr-dbi

R code optimization: For loop and writing to a database


I am trying to optimize a simple R code I wrote on two aspects:

1) For loops

2) Writing data into my PostgreSQL database

For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.

For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)

EDIT: I updated my code with a reproducible example below.

for (i in 1:100){

   search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)

   download.file(search, destfile ='scrape.html',quiet = TRUE)

   url <- read_html('scrape.html')

   github_title <- url%>%html_nodes(xpath="//div[@class=mt-n1]")%>%html_text()

   github_link <- url%>%html_nodes(xpath="//div[@class=mt-n1]//@href")%>%html_text()

   df <- data.frame(github_title, github_link )

   colnames(df) <- c("title", "link")

   dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)

   cat(i)
}

Thanks a lot for all your inputs!


Solution

  • First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.

    I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.

    n <- 1e5
    outputDat <- vector('list', n)
    for (i in 1:10000){
      id <- element_a[i]
      location <- element_b[i]
      language <- element_c[i]
      date_creation <- element_d[i]
      df <- data.frame(id, location, language, date_creation)
      colnames(df) <- c("id", "location", "language", "date_creation")
      outputDat[[i]] <- df
    }
    ## Combine data.frames
    outputDat <- do.call('rbind', outputDat)
    #Write the combined data.frame into the database.
    ##dbBegin(con)   #<= might speed up might not.
    dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
    ##dbCommit(con)  #<= might speed up might not.
    

    Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.

    #Create the statements. here 
    statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
    ##Optional: Print a bit of the statement
    # cat(substr(statement, 1, 2000))
    
    ##dbBegin(con)   #<= might speed up might not.
    dbExecute(con, statement <- paste0(
    '
    /*
      SET NOCOCUNT ON seems to be necessary in the DBI API.
      It seems to react to 'n rows affected' messages. 
      Note only affects this method, not the one using dbWriteTable
    */
    --SET NOCOUNT ON 
    INSERT INTO [my table] values ', statement))
    ##dbCommit(con)   #<= might speed up might not.
    

    Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.

    Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.