Search code examples
pythonarraysnumpypython-tesseract

Why a numpy array appears to have no shape?


I understand the following:

import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)

Output:

(2, 4)

So I was wondering why I get the following:

import numpy

import pytesseract
import logging

# Raw call does not need escaping like usual Windows path in python 
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

logging.basicConfig(level=logging.WARNING)
logging.getLogger('pytesseract').setLevel(logging.DEBUG)


image = r'C:\ocr\target\31832_226140__0001-00002b.jpg'
target = numpy.asarray(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,- \'" '))
print("target type is:",type(target))
print("target array shape is:",target.shape)

Output:

DEBUG:pytesseract:['C:\\Program Files\\Tesseract-OCR\\tesseract', 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg', 'C:\\Users\\david\\AppData\\Local\\Temp\\tess_p68ogbz9', '--dpi', '96', '--psm', '6', '-c', 'preserve_interword_spaces=1', '-c', "tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,- '", 'txt']
target type is: <class 'numpy.ndarray'>
target array shape is: ()

Okay. My array is text. But I still would have thought I would get parameter's example like say (1,999) for my shape?

Using the line print(target) gives the following type of output.

-------->snip<----------

196 ANGUS, Lynne Manon ........................128 Wellington Rd, Wemuomata Recepnonst
        197 ANGUS, Mane Joan .........00... ......129 Wellington Road, Weinumomata, Married
       198 ANGUS, Manon Jean .........................173 Wellington Road, Weinuiomata,Texi Driver
        199 ANGUS. Noel Fulton ........................127 Weinuomats Road, Weinuomate, Carpenter
   

Solution

  • This just means that you've created a scalar, i.e., an array with "no shape". Consider:

    >>> import numpy as np
    >>> arr = np.array(1)
    >>> arr
    array(1)
    >>> arr.shape
    ()
    

    This is because, I can only surmise, pytesseract.image_to_string returns a str object (or maybe a bytes object). So of course, you get:

    >>> np.asarray("some string object")
    array('some string object', dtype='<U18')
    >>> np.asarray("some string object").shape
    ()
    

    It isn't clear exactly what you expect to create. As you stated, you just have a text file, presumably, so why are you trying to create a numpy.ndarray object out of it? If you can elaborate on what you are trying to achieve, perhaps I or others can suggest an approach.