Search code examples

Extract metadata info from online pdf using pdfminer in python

I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf

I am trying to use a related solution discussed in-

Which uses the following code-

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
import io
import urllib.request
import requests

def pdf_to_text(pdf_file):
    text_memory_file = io.StringIO()

    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, text_memory_file, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
# get first 3 pages of the pdf file
    for page in PDFPage.get_pages(pdf_file, pagenos=(0, 1, 2)):
    text = text_memory_file.getvalue()
    return text

# # online pdf to text by urllib
# online_pdf_file=urllib.request.urlopen('')
# pdf_memory_file=io.BytesIO()
# pdf_memory_file.write(
# print(pdf_to_text(pdf_memory_file))

# online pdf to text by requests
response = requests.get('')
pdf_memory_file = io.BytesIO()

However, I am not able to find where to make the required changes to this code


  • You may find pdfplumber of interest - it's built on top of pdfminer.six and simplfies a lot of tasks.

    import io
    import pdfplumber
    import requests
    url = ""
    content = io.BytesIO(requests.get(url).content)
    pdf =
    >>> pdf.metadata
    {'Title': 'UnderstandingGIL',
     'Author': 'David Beazley',
     'Subject': '',
     'Producer': 'Mac OS X 10.6.2 Quartz PDFContext',
     'Creator': 'Keynote',
     'CreationDate': "D:20100220124003Z00'00'",
     'ModDate': "D:20100220124003Z00'00'",
     'Keywords': '',
     'AAPL:Keywords': ['']}