We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence page.
Ideally I would like to be able to write a python app to be able to scrape certain sections out of the PDF. The structure is pretty much as follows (with the bold parts being the ones I want to extract):
rest of the document is irrelevant in this case
I have managed to get it to import the file and extract (all) of the text, but I have really got no idea how to extract only the headings for section 2, and then for section 3 only take the table and reformat it with pandas. Any suggestions on how to go about this ?
import fitz
filename = '~\releasenotes.pdf'
doc = fitz.open(filename)
print (doc) # Just to see what comes out
(and now what should I do next ?)
A simple regex (regular expression) should do the trick here. I'm making some big assumptions around what the text looks like when it comes out of your pdf read - I have copied the text from your post and called it "doc" per your question :)
import re #regular expression library
doc = '''
Introduction
New Features
2.1. New Feature 1
description
2.2 New Feature 2
description
.
.
.
2.x) New Feature X description
'''
ds_features = pd.Series(re.findall('2.[1-9].*\n', doc))
Let me unpack that last line:
re.findall
will produce a list of items in your document that matches the search string
'2.[1-9].*\n'
will find all instances of a 2.
followed by any number from [1-9]
, followed by any number of characters .*
until it reaches a line break \n
.
Hope this fits the bill?