Search code examples
python-3.xperformancetext-parsing

What should I use between generator and function with return for resume parsing in python where I need to process lots of resume at a time?


I just need to be sure about the performance as currently i am working with functions with returns and and its takes too much time to display the whole result.Following is an approach using yeild

dirpath="E:\\Python_Resumes\\"

 def getResumeList(dirpath):
   resumes=[]
   files = os.listdir(dirpath)
   for file in files:
     if file.endswith(".pdf"):
         yield file

fileObject=getResumeList(dirpath)

def convertToRawText(fileObject):
 rawText=""
 resumeContent={}
 for file in fileObject:
    fContent=open(dirpath+file,'rb')
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fContent):
         interpreter.process_page(page)
         rawText = sio.getvalue()
         yield rawText


result=convertToRawText(fileObject)

for r in result:
   print(r)
   print("\n")

and following is an approach using return

def getResumeList(dirpath): 
 resumes=[]
 files = os.listdir(dirpath)# Get all the files in that directory
 for file in files:
    if file.endswith(".pdf"):
     resumes.append(file)
 return resumes

listOfFiles=getResumeList(dirpath)

def convertToRawText(files):
  rawText=""
  resumeContent={}
  for file in files:
    fContent=open(dirpath+file,'rb')
    rsrcmgr = PDFResourceManager()
    sio = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fContent):
         interpreter.process_page(page)
         rawText = sio.getvalue()
    resumeContent[file]=rawText
return resumeContent

bulkResumesText={}
bulkResumesText = convertToRawText(list(listOfFiles))

for b in bulkResumeText:
 print(bulkResumeText[b])

which would be better one from performance and efficiency point?


Solution

  • First things first I highly recommend to write Clean Code, that's mean when you writing Python don't write C#/Java (a.k.a PEP8)

    Another issue is: try to be pythonic (sometimes it's even make your code faster), for example instead of your getResumeList() in the generator example, try generator expression:

    def get_resume_list(dir_path):
        files = os.listdir(dir_path)
        return (f for f in files if f.endswith(".pdf"))
    

    Or list comprehension, in the second example:

    def get_resume_list(dir_path):
        files = os.listdir(dir_path)
        return [f for f in files if f.endswith(".pdf")]
    

    When you are opening a file try to use with, because people tend to forget closing files.

    About the efficiency it is clear that generators are created for that. With generator you can see each result as soon as it's ready, and not waiting for the whole code to finish processing.

    About the performance, I don't knon how many pdf files you are trying to parse, but I did a little test on 1056 pdf files, and the iterator was faster by couple of seconds (usually that's the case in measure of speed). generator are there for efficiency, look at this answer of Raymond Hettinger (Python core developer) explaining when not to use generators.

    For conclusion: in your case it is more efficient to use generator and faster with iterator.