Search code examples
rdirectoryrcppshinytree

Efficiently turn a file directory into a list in R, possibly with Rcpp


I currently started using shinyTree for one of my applications and I'm having trouble finding an efficient way in which to turn my directory into a list. My assumption is that the easiest way is to use something like Rcpp to take advantage of C++'s speed, but I'm not married to that idea. If that is the route to take however, my skill set in that arena is virtually zero, so I'm hoping someone might be able to provide a couple snippets of code to get me started in the right direction.

Here is the code I'm currently using to achieve what I'm trying to do:

create_directory_tree = function(root) {
  tree = list()
  file_lookup = data.frame(id=character(0), file_path=character(0), stringsAsFactors=FALSE)
  files = list.files(root, all.files=F, recursive=T, include.dirs=T)

  walk_directory = function(tree, path) {
    fp = file.path(root, path)
    is_dir = file.info(fp)$isdir
    if (is.null(is_dir) | is.na(is_dir)) {
      print(fp)
      return(NULL)
    }
    path = gsub("'|\"", "", path)
    folders = str_split(path, "/")[[1]]
    if (is.na(dir) | is.null(dir)) {
      print(paste("Failed:", fp))
      return(NULL)
    }
    if (is_dir) {
      txt = paste("tree", paste("$'", folders, "'", sep="", collapse=""), " = numeric(0)", sep="")
    } else {
      txt = paste("tree", paste("$'", folders, "'", sep="", collapse=""), " = structure('', sticon='file')", sep="")
    }
    eval(parse(text = txt))
    return(tree)
  }

  for (i in 1:length(files)) {
    tmp = data.frame(id=paste0("j1_", i), file_path=file.path(root, files[i]), stringsAsFactors=FALSE)
    file_lookup = rbind(file_lookup, tmp)
    tree = walk_directory(tree, files[i])
    save(tree, file_lookup, file="www/dir_tree.Rdata")
  }
}

This is taking an absurdly long time and I'm hoping there is something better. Thanks in advance.


Solution

  • The issue is you are growing the data.frame by rbind in

    file_lookup = rbind(file_lookup, tmp)
    

    Chances are the directory with root has lots and lots of content and, thus, the slow down happens when constantly copying and recreating the data.frame. You already have a length of the number of files (e.g. length(files)) so precreate the data.frame with

    files = list.files(root, all.files=F, recursive=T, include.dirs=T)
    nfiles = length(files)
    file_lookup = data.frame(id=character(nfiles), file_path=character(nfiles), stringsAsFactors=FALSE)
    

    Also, you are aiming to constantly save the progress of the object within the for loop, which is an I/O bottleneck. I would move:

    save(tree, file_lookup, file="www/dir_tree.Rdata")
    

    outside the loop.

    Lastly, there are several posts on Rcpp Gallery that would be ideal tutorial posts.