Search code examples
rimageocrdata-cleaning

clean data in r from image


I am trying to scan a text from an Ocr and clean it, I got a character that is divided to few lines, however I would like to have the text in similar to the way it is in the image

the code :

heraclitus<-"greek.png"
library(tidyverse)
library(tesseract)
library(magick)

image_greek<-image_read(heraclitus)

image_greek<-image_greek %>% image_scale("600") %>% 
  image_crop("600x400+220+150") %>% 
  image_convert(type = 'Grayscale') %>% 
  image_contrast(sharpen = 1) %>% 
  image_write(format="jpg")

heraclitus_sentences<-magick::image_read(image_greek)%>% 
  ocr() %>% str_split("\n")

As you can see from the output, I have white spaces and sentences that are divided to two lines. I would like to have it in a vector or a list, that each element will be a sentence

enter image description here

enter image description here


Solution

  • You need to split on \n\n (not \n) then replace the middle \n values:

    magick::image_read(image_greek) %>% 
      ocr() %>% 
      str_split("\n\n") %>%
      unlist() %>%
      str_replace_all("\n", " ")
    

    Output:

    [1] "© Much learning does not teach understanding."                                                       
    [2] "© The road up and the road down is one and the same."                                                
    [3] "© Our envy always lasts longer than the happiness of those we envy."                                 
    [4] "© No man ever steps in the same river twice, for it's not the same river and he's not the same man. "