Search code examples
configurationocrtesseractpython-tesseract

Why is pytesseract.image_to_string not preserving interword spaces?


Using Tesseract

PS C:\Program Files\Tesseract-OCR> .\tesseract --version
tesseract v5.3.0.20221222
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

I have tested Tesseract successfully on command line:

PS C:\Program Files\Tesseract-OCR> .\tesseract C:\ocr\target\31832_226140__0001-00002b.jpg C:\ocr\results\31832_226140__0001-00002bb6523dpi300fullest --dpi 300 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist='abcdefghijklm
nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '

Partial output

269 Wellington Road Wainumomats Marned       101 ARNOLD. Frank Witham ...............................15 Rossiter Avenue.Lower Hutt. Butcher
002 ANKER. Doreen Akson .............................4 Bledisioe Crescent. Wamuiomata. Teacher       102 ARONA. Amosa ...............0000...........3 Donnelley Drve.Wasnuiomata.Pub. Servant
004 ANKER. Robert James ..........................269 Wellington Road.WainuiomataBank Off       104 ARPS. Velde Lucia ................ ..........53 Westminster Road Wamnuomata Resch Intvr
005 ANNESLEV. Boyne Evan .............................. 13 Manurewa GroveWainwomata Clerk       105 ARPS. Wilkem David ..........................53 Westmnster Road. Waimuomata.Foreman
006 ANNESLEY. Janet Maree ....................13 Manurewa Grove Wainuomats Housewite       106 ARROWSMITH. Margaret Bessie .... ... . 4 Isabel Grove. Wainuiomata. Mamed
007 ANSELL. Anme Ena Elizabeth .........................3 Lewghton Av. Lower Hutt. Homemaker       107 ARROWSMITH. Morns Anthony ................ . 4 Isabel Gr Wamuomata Fetry Magr
O08 ANGELL. Eb se by oe ceeseceereeess 76 Bell Road. Lower Hutt. Housewrfe

I need to process hundreds of files so I downloaded and installed pytesseract.

Successfully installed pytesseract-0.3.10

I upgraded pip

Successfully installed pip-23.0.1

I have run tox

PS C:\Program Files\Tesseract-OCR> tox
←[1m←[35mROOT:←[0m←[36m No tox.ini or setup.cfg or pyproject.toml found, assuming empty tox.ini at C:\Program Files\Tesseract-OCR←[0m
  py: OK (4.34 seconds)
  congratulations :) (4.67 seconds)

However when I run the following, same path-to-exe, python script interword spacing is not preserved.

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
image = 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '))

Partial output

.269WellngtonRoadWainumomatsMarned       101ARNOLD.FrankWitham...............................15RossiterAvenue.LowerHutt.Butcher       
002ANKER.DoreenAkson.............................4BledisioeCrescent.Wamuiomata.Teacher       102ARONA.Amosa...............0000...........3DonnelleyDrve.Wasnuiomata.Pub.Servant      
004ANKER.RobertJames..........................269WellingtonRoad.WainuiomataBankOff       104ARPS.ValdaLucis..........................53WestminsterRoadWamnuomataReschIntvr
005ANNESLEV.BoyneEvan..............................13ManurewaGroveWainwomataClerk       105ARPS.WilkemDavid..........................53WestmnsterRoad.Waimuomata.Foreman
006ANNESLEY.JanotMaree....................13ManurewaGroveWainuomatsHousewite       106ARROWSMITH.MargaretBessie........4IsabelGrove.Wainuiomata.Mamed
007ANSELL.AnmeEnaElizabeth.........................3LewghtonAv.LowerHutt.Homemaker       107ARROWSMITH.MornsAnthony.................4IsabelGrWamuomata.FetryMagr
O008ANMGELL.Ebsebyyceeseceereeess76BellRoad.LowerHutt.Housewrfe       108ARTHUR.BruceJames....................65MoohanStreet.WainuomataApp.Mouider 

Can anyone see why this python-tesseract print image to string command is not using the config parameter preserve_interword_spaces=1 like the tesseract command line example?


Solution

  • The answer is making sure that you are NOT omitting the space character from the 'whitelist'. Because this effectively removes spaces from the output. Thus making it look like the preserve_interword_spaces=1 parameter is not functioning.

    For reference. The correct command should have been:

    target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))
    

    The use of single/double quotes is important. The single quotes surround the complete config statement. The double quotes for the literal whitelist.

    It would seem from this that the whitelist has precedence over the preserve_interword_spaces parameter. The preserve_interword_spaces parameter may be redundant if you are including a space in your whitelist.