I only want to extract text that has font size 9.800000000000068
and 10.000000000000057
from my pdf files.
The code below returns a list of the font size of each text block and its characters for one pdf file.
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
gives me an Extract_Data
list with the various font sizes
[[9.800000000000068, 'aaa\n'], [11.0, 'dffg\n'], [10.000000000000057, 'bbb\n'], [10.0, 'hs\n'], [8.0, '2\n']]
example: font size 10.000000000000057
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
if character.size == '10.000000000000057':
element.get_text()
Extract_Data.append(element.get_text())
Data = ''.join(map(str, Extract_Data))
gives me a Data
list with all of the text. How can i make it only extract font size '10.000000000000057'
characters?
['aaa\ndffg\nbbb\nhs\n2\n']
I also want to integrate into a function that does this for multiple files resulting in a pandas df that has one row for each pdf.
Desired output: [['aaa\n bbb\n']]
. Convertin pixels to points (int(character.size) * 72 / 96
) as suggested eksewhere did not help. Maybe this has something to do with this? https://github.com/pdfminer/pdfminer.six/issues/202
This is the function it would be integrated later on:
directory = 'C:/Users/Sample/'
resource_manager = PDFResourceManager()
for file in os.listdir(directory):
if not file.endswith(".pdf"):
continue
fake_file_handle = io.StringIO()
manager = PDFResourceManager()
device = PDFPageAggregator(manager, laparams=params)
interpreter = PDFPageInterpreter(manager, device)
device = TextConverter(interpreter, fake_file_handle, laparams=LAParams())
params = LAParams(detect_vertical=True, all_texts=True)
elements = []
with open(os.path.join(directory, file), 'rb') as fh:
parser = PDFParser(fh)
document = PDFDocument(parser, '')
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
for page in enumerate (PDFPage.create_pages(document)):
for element in page:
Pdfminer is the wrong tool for that.
Use pdfplumber (which uses pdfminer under the hood) instead https://github.com/jsvine/pdfplumber, because it has utility functions for filtering out objects (eg. based on font size as you're trying to do), whereas pdfminer is primarily for getting all text.
import pdfplumber
def get_filtered_text(file_to_parse: str) -> str:
with pdfplumber.open(file_to_parse) as pdf:
text = pdf.pages[0]
clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and obj["size"] != 9))
print(clean_text.extract_text())
get_filtered_text("./my_pdf.pdf")
The example above I've shown is easier than yours because it just checks for font size 9.0, and you have
9.800000000000068 and 10.000000000000057
so the obj["size"] condition will be more complex in your case
obj["size"]
has the datatype Decimal
(from decimal import Decimal
) so you probably will have to do something like obj["size"].compare(Decimal(9.80000000068)) == 0