Search code examples
pythonpython-3.xweb-scrapingpypdf

Can't fetch only the names from a table located in a pdf file from a webpage


I've created a script in python using requests module and PyPDF2 library to parse the pdf content from a website. I'm only interested in the name in column A under Facility Name available in page 4 (tabular content) in that pdf file. My script can scrape the content from that page but I can't find any way to get only the names and nothing else.

pdf file link that I've used within the script

This is how the table looks like

I'm only interested in the names under the column header Facility Name.

I've tried with:

import io
import PyPDF2
import requests

URL = 'https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/CertificationandComplianc/Downloads/SFFList.pdf'

res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(3).extractText()
print(contents)

Output I'm having right now are like:

Facilit
y Name
Address
City
State
Zip
Phone 
Number
Months as an 
SFFWillows Center
320 North Crawford Street
Willows
CA95988530-934-2834
5Winter Park Care & Rehabilitation Center
2970 Scarlett Rd
Winter Park
FL32792407-671-8030
and so on -----

Output I wish to have like:

Willows Center
Winter Park Care & Rehabilitation Center
Pinehill Nursing Center
River Brook Healthcare Center

How can I get only the names available in a table from a pdf file?


Solution

  • Unfortunately for you PDF is not a structured document, it's just strings/images placed on coordinates to look exactly as it's created regardless of which program renders it. This means you cannot parse it as easy as HTML, because tables are not under a <table> element, but scattered across a page.

    See:

    Take a look at https://github.com/atlanhq/camelot, it might help you

    (There's at most 10 pages there with a table, going manual might be a faster option here, unless you have many PDFs like this.)