I'm trying to find a way to section the text I have already extracted into two variables. I'm using this on scientific texts and I want to extract the abstract and the rests of the articles e.g. introduction to conclusion into two, so abstract and the rest.
How I can do this? I have tried regex but could not get it to work. Below you can see some of the code I have used.
with pdfplumber.open("") as pdf:
all_text = '' # new line
for pdf_page in pdf.pages:
single_page_text = pdf_page.extract_text()
#print( single_page_text )
all_texts = all_text + '\n' + single_page_text
#print(all_text)
I'm assuming that the abstract is entailed by the strings "Abstract" and "*Correspondence". I'm using str.split() to create a list which contains the text before and after "Abstract". I split the second element of the list creating a list which contains the text before "*Correspondence" and the text after "*Correspondence". The first element of the second list is the abstract. I append everything but the abstract to another variable. Since the abstract is contained on the first page this is only applied to the first page. The first page is selected by using enumerate.
import pdfplumber as pdfplumber
with pdfplumber.open("s12865-020-00390-9.pdf") as pdf:
text_without_abstract = ''
abstract = ''
for index, pdf_page in enumerate(pdf.pages):
if index == 0:
single_page_text = pdf_page.extract_text()
split_at_abstract = single_page_text.split("Abstract")
text_without_abstract += split_at_abstract[0]
split_at_asterisk_correspondence = split_at_abstract[1].split("*Correspondence")
abstract = split_at_asterisk_correspondence[0]
text_without_abstract += split_at_asterisk_correspondence[1]
else:
text_without_abstract += pdf_page.extract_text()
Caution: This approach is very dependent on the string content of the document. It will not work if the string "Abstract" occurs inside the abstract or the first string after the abstract is not "*Correspondence".
str.split() : https://docs.python.org/3.8/library/stdtypes.html#str.split