I am stuck with how to deal with pdfs here. I dont know how to scrape directly from the web, and when I download locally they are complete nonsense, not the actual text data.
I have tried to download with requests but the contents is then just useless.
import PyPDF2
# textract
import requests
# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords
def get_amount(url):
data = requests.get(url)
with open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf', 'wb') as f:
f.write(data.content)
I am trying to figure out how to get data from a pdf. Any suggestons would be greatly appreciated!
Please modify to below:
import PyPDF2
pdf_file = open('/Users/derricdonehoo/code/derric-d/price-processor/exmpl.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for i in number_of_pages:
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content