I have a large volume (aprx 10 000) jpg files with dates written on each one. I wish to extract the date from each jpg and add this to a dataframe with a corresponding filename.
I have read this forum and beyond and I have tried to patch together a function in R which will perform the task but I cannot get it to work. I have used a loop to:
1) generate a list of image files in the chosen directory
2) create a dataframe for the results with a column for file path and a column for date (extracted from the jpg)
3) loop through files in directory: Resize, Crop to portion of image showing date, OCR the image, Write date to dataframe - created in step 2
This seems to crash when I run the function and I am not really sure why. I am an R user but I have not written functions before (you can probably tell)
I am using R 3.6.0 and RStudio
library(tesseract)
library(magick)
library(tidyverse)
library(gsubfn)
get_jpeg_date <- function(folder) {
file_list <- list.files(path=folder, pattern="*.jpg", recursive = T)
image_dates <- as.data.frame(file_list)
image_dates $ ImageDate <- rep_len(x = NA, length.out = length(file_list))
eng <- tesseract("eng")
for (i in length(file_list) ) {
ImageDate <- image_read(paste(folder,"\\",file_list, sep = ""))%>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)%>%
image_dates[,i]
}
}
x <- get_jpeg_date(folder = folder)
folder <- "C:/file_path"
x <- get_jpeg_date(folder = folder)
The code in the loop works on single files but there is no output when I run the function on a small test sample of 3 jpg images.
Consider re-factoring your function to run on a single jpg file, then assign column to it with sapply
or map
. In R, the last line of a function is the return object. Since for
loops are not the last process, function will return the OCR'ed and regex-ed string vector.
get_jpeg_date <- function(pic) {
eng <- tesseract("eng")
image_read(pic) %>%
image_resize("2000") %>%
image_crop("300x100+1800") %>%
tesseract::ocr(engine = eng) %>%
strapplyc("\\d+/\\d+/\\d+", simplify = TRUE)
}
file_list <- list.files(path=folder, pattern="*.jpg", full.names = TRUE, recursive = TRUE)
# DATA FRAME BUILD
image_dates_df <- data.frame(img_path = file_list)
# COLUMN ASSIGNMENT
image_dates_df$img_date <- sapply(image_dates_df$img_path, get_jpeg_date)
# ALTERNATIVELY WITH dplyr::mutate() and purrr:map()
image_dates_df <- data.frame(img_path = file_list) %>%
mutate(img_date = map(img_path, get_jpeg_date))