Search code examples
pythonpdftext-mining

Is there any function to extract the text which has a specific heading from pdf


I have multiple paragraphs in my pdf document. Each paragraph has a unique Heading to it. How can I extract the text from the pdf under a specific heading that I am looking for


Solution

  • you can use PyPDF2 python library for that, sample snippets :

    # importing required modules
    import PyPDF2
    
    # creating a pdf file object
    pdfFileObj = open('example.pdf', 'rb')
    
    # creating a pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    # printing number of pages in pdf file
    print(pdfReader.numPages)
    
    # creating a page object
    pageObj = pdfReader.getPage(0)
    
    # extracting text from page
    print(pageObj.extractText())
    
    # closing the pdf file object
    pdfFileObj.close()