Search code examples
pythonfunctionmultiprocessingglobal-variablespython-tesseract

How to assign list created inside a function to the data frame column


I am using python 3.7.4 on VS code. I created a function img_to_text() which takes argument as a pdf file. This function creates JPEG of the first page of the PDF and use pytesseract.image_to_string() method to read string from the image. This string is then searched for some names and if a name appear in the string, it's then appended to the list main_consultant_name().

Since the run time for the whole process was considerably high, I used multiprocessing to reduce the run time which was indeed reduced to 2 minutes from the sequential runtime of 34 minutes for 258 PDFs

def img_to_text(file):
    main_consultant_name = []
    pytesseract.pytesseract.tesseract_cmd = r'C:\Users\.....\......\Tesseract-OCR\tesseract.exe'
    pages = convert_from_path("C:/pdfs" + file + '.pdf', 500, last_page= 1)
    for page in pages:
        filename = file +'_Page1.jpg'
        page.save("C:/Users/................" + filename, 'JPEG')
        text = str(((pytesseract.image_to_string(Image.open("C:/Users/............../" + filename))))).lower().replace('\n\n',' ')
        consultant_name = []
        for name in consultant_name_lst:
            if name.lower() in text:
                consultant_name.append(name)
        main_consultant_name.append(consultant_name)
    return main_consultant_name

def process_handler():
    with engine.connect() as conn:
        query1 = "SELECT * FROM pdfs;"
        df1 = pd.read_sql(query1, conn)
    files = [file for file in df1['pdfName']]
    with Pool() as pool:
        results = pool.map(img_to_text, files)
    for result in results:
        print(result)

df1['consultant_name'] = main_consultant_name     # problem is here

I am trying to add a column in dataframe df1 from list main_consultant_name, but I get an error message NameError: name 'main_consultant_name' is not defined. I did some research and kind of got some idea that since list has been defined inside the function, it cannot be accessed outside of the function. I tried to globally define the list but it did not work and returned the same error message.

Any ideas as to what am I doing wrong here? Thanks a lot!


Solution

  • Well, the explanation is because the concept of namespaces and variable scope, there are always three namespaces which are Built-in, global and local and sometimes there is one more called Enclosing. In short, the variables declared in the module belongs to global namespace and the variables declared in a function belongs to local and you can access from local to global as below:

    a = 'Hello'
    def testing():
      return a
    
    print(testing()) # Prints 'Hello'
    

    But you can't access from global to local and it's what you're trying to do in your code, just to show you with same before example:

    def testing():
      a = 'Hello'
      return a
    
    print(a)
    

    Raises the error: NameError: name 'a' is not defined

    So what you can do is catching what img_to_text returns and then assign to df1['consultant_name']:

    def testing():
      a = 'Hello'
      return a
    
    result = testing()
    print(result) # Prints 'Hello'
    

    Or something like this using global but it's not recommended :

    a = ''
    def testing():
      global a
      a = 'Hello'
      return a
    
    result = testing()
    print(result)
    

    Hope this can help you :)