Search code examples
pythonpython-3.xpandaspymupdf

Python scraping an unstructured PDF


We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.

Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):

  1. Introduction
  2. New Features
    2.1. New Feature 1
    description
    2.2 New Feature 2
    description
    .
    .
    .
    2.x) New Feature X description
  3. Defect fixes
    description
    table with defect descriptions

rest of the document is irrelevant in this case

I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?

import fitz

filename = '~\releasenotes.pdf'

doc = fitz.open(filename)
print (doc) #  Just to see what comes out

(and now what should I do next ?)


Solution

  • A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)

    import re #regular expression library
    
    doc = '''
    Introduction
    New Features
    2.1. New Feature 1
    description
    2.2 New Feature 2
    description
    .
    .
    .
    2.x) New Feature X description
    '''
    
    ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))
    

    Let me unpack that last line: re.findall will produce a list of items in your document that matches the search string '2.[1-9].*\n' will find all instances of a 2. followed by any number from [1-9], followed by any number of characters .* until it reaches a line break \n.

    Hope this fits the bill?