Search code examples
rimagepurrrmagick

How can I prevent my computer from crashing when running R-script on large dataset


  • Goal: Read, compress and write images from one location to another. The image dataset is about 5 TB in size. The average size of the individual images is about 2-5 Mb.
  • Problem: When I run it for the whole dataset, my Mac crashes after about 1 GB. The script works for a subset of about 400 images.

By reading in the images one by one, I had hoped it would not require too much memory and processing power, but I probably missed something. Could some review my code below and provide insight in why it crashes? Any tips and suggestions would be very much appreciated. Apologies for not giving a reproducible example.

## 1. LOAD PACKAGES
library(magick)
library(purrr)
library(furrr)

## 2. SET MAIN FOLDER
Directory_Folder <- "C:/Users/Nick/Downloads/" 
Folder_Name <- "Photos for Nick"

## 3. SET NEW LOCATION
New_Directory <- "C:/Users/Daikoro/Desktop/"     ## MAKE SURE TO INCLUDE THE FINAL FORWARD SLASH

## 4. LIST ALL FILES
list.of.files <- list.files(path = paste0(Directory_Folder, Folder_Name), full.names = TRUE, recursive = TRUE)

## 5. FUNCTION FOR READING, RESIZING, AND WRITING IMAGES
MyFun <- function(i) {
  
  new.file.name <- gsub(Directory_Folder, New_Directory, i)
  
  magick::image_read(i) %>%  ## IMPORT PHOTOS INTO R
            image_scale("400") %>%  ## IMAGE RE-SCALING
            image_write(path = new.file.name)
}

## 6. SET UP MULTI-CORES
future::plan(multiprocess)

## 7. RUN FUNCTION ON ALL FILES
future_map(list.of.files, MyFun)   ## THIS WILL TAKE A WHILE...AND CRASHES AT 1GB

Solution

  • With the feedback from Ben Bolker, r2evans, and Waldi I managed to get the script going. I added gc() in the last line of MyFun. And also specified a number of cores like this:

    ## SET UP MULTI-CORES
    no_cores <- availableCores() - 1
    future::plan(multisession, workers = no_cores)
    

    While this made the script much slower, at least it didn't crash. I'm not sure if that's because I more processing cores were available, or because of the gc() line.