Search code examples
pythonpdfnlpnltkpypdf

Python: How to solve merged words when extracting text from pdf?


I'm struggling with the words extraction from a set of pdf files. This files are academic papers that I downloaded from the web.

The data is stored in my local device, sorted by name, following this relative path inside the project folder: './papers/data'. You can find my data here.

My code is executing inside a code folder in the project repo ('./code')

The pdf word extraction section of the code look like this:

import PyPDF2 as pdf
from os import listdir 

#Open the files:
#I) List of files:
files_in_dir = listdir('../papers/data')
#II) Open and saving files to python objects:
papers_text_list = []
for idx in range(len(files_in_dir)):
    with open(f"../papers/data/{files_in_dir[idx]}", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    vars()["text_%s" % idx] = ''
    for i in range(my_pdf.numPages):
        page_to_print = my_pdf.getPage(i)
        vars()["text_%s" % idx] += page_to_print.extractText()
    papers_text_list.append(vars()["text_%s" %idx])

The problem is that for some texts I'm geting merged words inside the python list.

text_1.split()

[ ... ,'examinedthee', 'ectsofdi', 'erentoutdoorenvironmentsinkindergartenchildren', '™sPAlevel,', 'ages3', 'Œ5.The', 'ndingsrevealedthatchildren', '‚sPAlevelhigherin', 'naturalgreenenvironmentsthaninthekindergarten', '™soutdoorenvir-', 'onment,whichindicatesgreenenvironmentso', 'erbetteropportunities', 'forchildrentodoPA.', ...]

While other list are imported in a correct way.

text_0.split()

['Urban','Forestry', '&', 'Urban', 'Greening', '16', '(2016)','76–83Contents', 'lists', 'available', 'at', 'ScienceDirect', 'Urban', 'Forestry', '&', 'Urban', 'Greening', ...]

At this point, I thought that tokenize could solve my problem. So I give it a chance to the nltk module.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
doc = tokenizer,tokenize(text_1)
paper_words = [token for token in doc]
paper_words_lower = []
for token  in paper_words:
    try:
        word = token.lower()
    except TypeError:
        word = token 
    finally:
        paper_words_lower.append(word)

['contentslistsavailableat', 'sciencedirecturbanforestry', 'urbangreening', 'journalhomepage', 'www', 'elsevier', 'com', 'locate', 'ufug', 'urbangreenspacesforchildren', 'across', 'sectionalstudyofassociationswith', 'distance', 'physicalactivity', 'screentime', 'generalhealth', 'andoverweight', 'abdullahakpinar', 'adnanmenderesüniversitesi', 'ziraatfakültesi', 'peyzajmimarl', 'bölümü', '09100ayd', 'õn', 'turkey', ... 'sgeneralhealth', 'onlychildren', 'sagewas', 'signicantlyassociatedwiththeiroverweight', ...]

I even tried with the spacy module... but the problem was still there.

My conclusion here is that if the problem can be solved It has to be in the pdf extracting words section. I found this StackOverflow related question but the solution couldn't solve my problem.

Why is this happening? and How can I solve it?

PD: A paper on the list that serve as an example of trouble is "AKPINAR_2017_Urban green spaces for children.pdf".

You can use the following code to import.

import PyPDF2 as pdf
with open("AKPINAR_2017_Urban green spaces for children.pdf", mode="rb") as paper:
    my_pdf = pdf.PdfFileReader(paper)
    text = ''
    for i in range(my_pdf.numPages):
         page_to_print = my_pdf.getPage(i)
         text += page_to_print.extractText()

Solution

  • By far the simplest method is use a modern PDF viewer/editor that allows for cut and paste with some additional adjustments. I had no problems reading aloud or extracting most of those academic journals since they are (bar one) readable text, thus export well as plain text. It took 4 seconds TOTAL to export 24 of those PDF files (6 per second, except #24of25) into readable text. using forfiles /m *.pdf /C "cmd /c pdftotext -simple2 @file @fname.txt". Compare the result with your first non readable example. enter image description here

    However the one exception was Hernadez_2005 because it is images thus to extract needs OCR conversion with considerable (not trivial) training of the editor to handle scientific terms and foreign hyphenation, plus constantly shifting styles. But can with some work in say WordPad produce a good enough result, fit for editing in Microsoft Word, which you could save as plain text for parsing in Python.

    enter image description here