Search code examples
rocrtesseract

Can I extract certain words from this image using tesseract ocr package in R?


I have tried using the ocr tesseract package in R to extract text from a png image (below)

png image

The text is mostly in Spanish. Here is my code:

library(tesseract)
#tesseract_download("spa") #download the Spanish train data if you haven't already
spanish <- tesseract("spa")
path_string <- "factura.png"
text <- ocr(path_string, engine = spanish)
cat(text)

But the result is disappointing.

ném…c……
…r …
nw£ccwm … m…… u
mmm …"
pz… u—=,:4| nm;
mmmnzvgm 3134
NUM“ vmnscwm
cuaw ……er
nmcmvcn4 c…r vum
£m|unmusnm . u7m
¡…una
suma… ……
ncm u|s
m:s .
mm u7m
cmmo 1240
nmrAm au…va m m
m.
515 mu .…
…
=mmnzmo
a… rn¿a> rc.¿… ……
u7m
Rm mmm… swmks
…… mmm
m…—
Guuumwsucmm

Is this poor result due to low dpi? Would it be possible to improve this by tinkering with the pre-processing?

For each of these receipts, what I really need is just to pull out the line item with the word "equilibrio" and the value to the right of that (41,760 in this case). Can tesseract be told to focus only on certain words and to also pull out numbers?


Solution

  • I have been able with the following code :

    library(tesseract)
    library(magick)
    
    path_PDF <- "D:\\ymavy.pdf"
    path_PNG <- "D:\\ymavy.png"
    
    pdf(path_PDF, height = 12, width = 20)
    
    im <- image_read(path_PNG)
    plot(im)
    dev.off()
    
    text <- ocr(path_PDF)
    strsplit(text, "\n")
    
    [[1]]
     [1] "Réginen Comun"                   "NIT: $000808S5S-4"               "DIRECCION: Cra 68d#22b-71 L4"    "TELEFONO 4059003-3023099514"    
     [5] "FECHA 18-aqg0.-78 13:03:29"      "FACTURADE VENTA 312364"          "NOMBRE VENTAS CONTADO"           "CAJERO: ADMINISTRADOR"          
     [9] "DESCRIPCION CANT VALOR"          "EQUILIBRIOGATO j 41.760"         "FILHOTES"                        "SUBTOTAL 46,400"                
    [13] "DCTO. 4479"                      "ITEMS 1"                         "TOTAL 47,7650"                   "CAMBIO 8,240"                   
    [17] "TARIFAIVA BASE GRAVABLE VR.IVA"  "o% 0 0"                          "5% 39,771 1,888"                 "19% ) 0"                        
    [21] "-ORMAS DE PAGO"                  "“fectwo T.Débao T.Crédao Cuenta" "47,760 0 0 0"                    "RESOL. FACTURACION SISTEMA POS" 
    [25] "I8762005857222 2017-11-27"       "20086 al 1000000"                "GRACIAS POR SU COMPRA"          
    
    

    Basically, you convert the image to a PDF with an adjusted size. Afterwards, you use the OCR.