I am using python 3.7.4 on VS code. I created a function img_to_text()
which takes argument as a pdf file. This function creates JPEG of the first page of the PDF and use pytesseract.image_to_string()
method to read string from the image. This string is then searched for some names and if a name appear in the string, it's then appended to the list main_consultant_name()
.
Since the run time for the whole process was considerably high, I used multiprocessing to reduce the run time which was indeed reduced to 2 minutes from the sequential runtime of 34 minutes for 258 PDFs
def img_to_text(file):
main_consultant_name = []
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\.....\......\Tesseract-OCR\tesseract.exe'
pages = convert_from_path("C:/pdfs" + file + '.pdf', 500, last_page= 1)
for page in pages:
filename = file +'_Page1.jpg'
page.save("C:/Users/................" + filename, 'JPEG')
text = str(((pytesseract.image_to_string(Image.open("C:/Users/............../" + filename))))).lower().replace('\n\n',' ')
consultant_name = []
for name in consultant_name_lst:
if name.lower() in text:
consultant_name.append(name)
main_consultant_name.append(consultant_name)
return main_consultant_name
def process_handler():
with engine.connect() as conn:
query1 = "SELECT * FROM pdfs;"
df1 = pd.read_sql(query1, conn)
files = [file for file in df1['pdfName']]
with Pool() as pool:
results = pool.map(img_to_text, files)
for result in results:
print(result)
df1['consultant_name'] = main_consultant_name # problem is here
I am trying to add a column in dataframe df1
from list main_consultant_name
, but I get an error message NameError: name 'main_consultant_name' is not defined
. I did some research and kind of got some idea that since list has been defined inside the function, it cannot be accessed outside of the function. I tried to globally define the list but it did not work and returned the same error message.
Any ideas as to what am I doing wrong here? Thanks a lot!
Well, the explanation is because the concept of namespaces
and variable scope
, there are always three namespaces
which are Built-in
, global
and local
and sometimes there is one more called Enclosing
. In short, the variables declared in the module belongs to global namespace
and the variables declared in a function belongs to local
and you can access from local
to global
as below:
a = 'Hello'
def testing():
return a
print(testing()) # Prints 'Hello'
But you can't access from global
to local
and it's what you're trying to do in your code, just to show you with same before example:
def testing():
a = 'Hello'
return a
print(a)
Raises the error: NameError: name 'a' is not defined
So what you can do is catching what img_to_text
returns and then assign to df1['consultant_name']
:
def testing():
a = 'Hello'
return a
result = testing()
print(result) # Prints 'Hello'
Or something like this using global
but it's not recommended :
a = ''
def testing():
global a
a = 'Hello'
return a
result = testing()
print(result)
Hope this can help you :)