Search code examples
pythonpython-3.xelasticsearchnlppypdf

Extracting information from multiple resumes all in PDF format


I have a data set with a column which has google drive link for resumes, I have 5000 rows so there are 5000 links , I am trying to extract information like years of experience and salary from these resumes in 2 separate columns. so far I've seen so many examples mentioned here on SO.

For example: the code mentioned below can only read the data from one file , how do I replicate this to multiple rows ?

Please help me with this , else I will have to manually go through 500 resumes and fill in the data

Hoping that I'll get a solution for this painful problem that I have.

pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

#to extract salary , experience using regular expressions
import re

prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")

if result:
    print result.group(0)

result = prog.match("University: MIT")

if result:
    print result.group(0)

Solution

  • Use a loop. Basically you put your main code into a function (easier to read) and create a list of filenames. Then you iterate over this list, using the values from the list as argument for your function:

    Note: I didn't check your scraping code, just showing how to loop. There are also way more efficient ways to do this, but I'm assuming you're somewhat of a Python beginner so lets keep it simple to start with.

    # add your imports to the top
    import re
    
    # create a list of your filenames
    files_list = ['a.pdf', 'b.pdf', 'c.pdf']
    for filename in files_list:  # iterate over the list
        get_data(filename)
    
    
    # put the rest in a function for readability
    def get_data(filename):
        pdf_file = open(filename, 'rb')
        read_pdf = PyPDF2.PdfFileReader(pdf_file)
        number_of_pages = read_pdf.getNumPages()
        page = read_pdf.getPage(0)
        page_content = page.extractText()
        print page_content.encode('utf-8')
    
        prog = re.compile("\s*(Name|name|nick).*")
        result = prog.match("Name: Bob Exampleson")
    
        if result:
            print result.group(0)
    
        result = prog.match("University: MIT")
    
        if result:
            print result.group(0)
    

    So now your next question might be, how do I create this list with 5000 filenames? This depends on what the files are called and where they are stored. If they are sequential, you could to something like:

    files_list = []  # empty list
    num_files = 5000  # total number of files
    for i in range(1, num_files+1):
        files_list.append(f'myfile-{i}.pdf')
    

    This will create a list with 'myfile-1.pdf', 'myfile-2.pdf', etc.

    Hopefully this is enough to get you started.

    You can also use return in your function to create a new list with all of the output which you can use later on, instead of printing the output as you go:

    output = []
    
    def doSomething(i):
        return i * 2
    
    for i in range(1, 100):
        output.append(doSomething(i))
    
    # output is now a list with values like:
    # [2, 4, 6, 8, 10, 12, ...]