I just need to be sure about the performance as currently i am working with functions with returns and and its takes too much time to display the whole result.Following is an approach using yeild
dirpath="E:\\Python_Resumes\\"
def getResumeList(dirpath):
resumes=[]
files = os.listdir(dirpath)
for file in files:
if file.endswith(".pdf"):
yield file
fileObject=getResumeList(dirpath)
def convertToRawText(fileObject):
rawText=""
resumeContent={}
for file in fileObject:
fContent=open(dirpath+file,'rb')
rsrcmgr = PDFResourceManager()
sio = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fContent):
interpreter.process_page(page)
rawText = sio.getvalue()
yield rawText
result=convertToRawText(fileObject)
for r in result:
print(r)
print("\n")
and following is an approach using return
def getResumeList(dirpath):
resumes=[]
files = os.listdir(dirpath)# Get all the files in that directory
for file in files:
if file.endswith(".pdf"):
resumes.append(file)
return resumes
listOfFiles=getResumeList(dirpath)
def convertToRawText(files):
rawText=""
resumeContent={}
for file in files:
fContent=open(dirpath+file,'rb')
rsrcmgr = PDFResourceManager()
sio = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, sio, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fContent):
interpreter.process_page(page)
rawText = sio.getvalue()
resumeContent[file]=rawText
return resumeContent
bulkResumesText={}
bulkResumesText = convertToRawText(list(listOfFiles))
for b in bulkResumeText:
print(bulkResumeText[b])
which would be better one from performance and efficiency point?
First things first I highly recommend to write Clean Code
, that's mean when you writing Python
don't write C#/Java
(a.k.a PEP8)
Another issue is: try to be pythonic
(sometimes it's even make your code faster),
for example instead of your getResumeList()
in the generator example, try generator expression
:
def get_resume_list(dir_path):
files = os.listdir(dir_path)
return (f for f in files if f.endswith(".pdf"))
Or list comprehension, in the second example:
def get_resume_list(dir_path):
files = os.listdir(dir_path)
return [f for f in files if f.endswith(".pdf")]
When you are opening a file try to use with
, because people tend to forget closing files.
About the efficiency it is clear that generators are created for that. With generator you can see each result as soon as it's ready, and not waiting for the whole code to finish processing.
About the performance, I don't knon how many pdf files you are trying to parse, but I did a little test on 1056 pdf files, and the iterator was faster by couple of seconds (usually that's the case in measure of speed). generator are there for efficiency, look at this answer of Raymond Hettinger (Python core developer) explaining when not to use generators.
For conclusion: in your case it is more efficient to use generator and faster with iterator.