Search code examples
rloopsjpegocrtesseract

Function to extract date from jpg files in a directory


I have a large volume (aprx 10 000) jpg files with dates written on each one. I wish to extract the date from each jpg and add this to a dataframe with a corresponding filename.

I have read this forum and beyond and I have tried to patch together a function in R which will perform the task but I cannot get it to work. I have used a loop to:

1) generate a list of image files in the chosen directory

2) create a dataframe for the results with a column for file path and a column for date (extracted from the jpg)

3) loop through files in directory: Resize, Crop to portion of image showing date, OCR the image, Write date to dataframe - created in step 2

This seems to crash when I run the function and I am not really sure why. I am an R user but I have not written functions before (you can probably tell)

I am using R 3.6.0 and RStudio

library(tesseract)
library(magick)
library(tidyverse)
library(gsubfn)

get_jpeg_date <- function(folder) {
  file_list <- list.files(path=folder, pattern="*.jpg", recursive = T)
  image_dates <- as.data.frame(file_list)
  image_dates $ ImageDate <- rep_len(x = NA, length.out = length(file_list))
  eng <- tesseract("eng")

  for (i in length(file_list) ) {
    ImageDate <- image_read(paste(folder,"\\",file_list, sep = ""))%>% 
  image_resize("2000") %>%
  image_crop("300x100+1800") %>%
  tesseract::ocr(engine = eng) %>%
  strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)%>%
      image_dates[,i]
  }
}

x <- get_jpeg_date(folder = folder)
folder <- "C:/file_path"

x <- get_jpeg_date(folder = folder)

The code in the loop works on single files but there is no output when I run the function on a small test sample of 3 jpg images.


Solution

  • Consider re-factoring your function to run on a single jpg file, then assign column to it with sapply or map. In R, the last line of a function is the return object. Since for loops are not the last process, function will return the OCR'ed and regex-ed string vector.

    get_jpeg_date <- function(pic) {    
        eng <- tesseract("eng")
    
        image_read(pic) %>% 
            image_resize("2000") %>%
            image_crop("300x100+1800") %>%
            tesseract::ocr(engine = eng) %>%
            strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)    
    }
    
    file_list <- list.files(path=folder, pattern="*.jpg", full.names = TRUE, recursive = TRUE)
    
    # DATA FRAME BUILD
    image_dates_df <- data.frame(img_path = file_list)
    # COLUMN ASSIGNMENT
    image_dates_df$img_date <- sapply(image_dates_df$img_path, get_jpeg_date)
    
    # ALTERNATIVELY WITH dplyr::mutate() and purrr:map()
    image_dates_df <- data.frame(img_path = file_list) %>%
               mutate(img_date = map(img_path, get_jpeg_date))