I am trying to scan a text from an Ocr and clean it, I got a character that is divided to few lines, however I would like to have the text in similar to the way it is in the image
the code :
heraclitus<-"greek.png"
library(tidyverse)
library(tesseract)
library(magick)
image_greek<-image_read(heraclitus)
image_greek<-image_greek %>% image_scale("600") %>%
image_crop("600x400+220+150") %>%
image_convert(type = 'Grayscale') %>%
image_contrast(sharpen = 1) %>%
image_write(format="jpg")
heraclitus_sentences<-magick::image_read(image_greek)%>%
ocr() %>% str_split("\n")
As you can see from the output, I have white spaces and sentences that are divided to two lines. I would like to have it in a vector or a list, that each element will be a sentence
You need to split on \n\n
(not \n
) then replace the middle \n
values:
magick::image_read(image_greek) %>%
ocr() %>%
str_split("\n\n") %>%
unlist() %>%
str_replace_all("\n", " ")
Output:
[1] "© Much learning does not teach understanding."
[2] "© The road up and the road down is one and the same."
[3] "© Our envy always lasts longer than the happiness of those we envy."
[4] "© No man ever steps in the same river twice, for it's not the same river and he's not the same man. "