Error while reading PDF on SharePoint page using Python 3.7

I'm using "PyPDF2", "urllib.request", "io" and "requests_ntlm"+"HttpNtlmAuth" to access a company SharePoint 2010 page and, read and print a PDF as a string.

I already had success by reading and printing a PDF on both a PDF file and an online PDF, but with SharePoint 2010 seems not to function.

This is my code:

import urllib.request
import requests
import io
import PyPDF2 as p2
from requests_ntlm import HttpNtlmAuth

url = "http://company.net/sites/folder1/folder2/folder3/folder4/sample.pdf"

response = requests.get(url, auth=HttpNtlmAuth("USER@MAIL.COM","PASSWORD"))

print(response.status_code)
print(response.content)

open = urllib.request.urlopen(response.content).read()

PDFfile = io.BytesIO(open)

pdfread = p2.PdfFileReader(PDFfile)
NrPages = pdfread.getNumPages()

for i in range(NrPages):
    x = pdfread.getPage(i)
    y = str(x.extractText())
    print(y)

Regarding "status_code", I got the HTTP 200 code, so I assume that access has been accepted. When printing the "response.content" I get something like this:

b'%PDF-1.3\r%\xff\xff\xff\xff\r1 0 obj\r<<\r/Title (\xfe\xff\x00D\x00S\x00-\x00T\x00B\x00L\x00-\x00M\x00A\x00X\x00-\x00D\x00P\x00L\x00S\x00-\x00S\x00C\x00-\x001\x004\x000\x000\x000\x00-\x001\x001\x004\x000\x000\x00-\x001\x001\x000\x00 \x00k\x00s\x00i\x00-\x00M\x00i\x00n\x00.\x00 \x00W\x00T\x00 \x009\x000\x00%\x00-\x00\\(\x00U\x00S\x00C\x00 \x00U\x00n\x00i\x00t\x00s\x00\\)\x00.\x00x\x00l\x00s\x00m)\r/Producer (Amyuni PDF Converter version 4.5.2.7)\r/CreationDate (D:20140624164324-03\'00\')\r>>\rendobj\r7 0 obj\n<< /Length 8 0 R /Filter /FlateDecode >>\nstream\nx\x9c\xb5Z[s\x9bH\x16~O\x95\xffC\xbfL\x8d\xb3\x15a\xfaB\xd3\xbc\xad/J\xc6\x93\xf82\x96\xbc\xaeT\xf9\x85H\xc8f#!\x05\xe1M\xfc\xa7\xb6\xf6\'\xee9M\x03\x8d\xa0\x91\xb23[S\xe3H\xa2\xf9\xf8\xce\xfd\x9cn\xbe\x1dQ\xe2\xc3\x7f\x94P\xe9E\x0c>\xcdVG\xd9\x11\xfe\xe4{\xd2g$\x7f:\x12\xa1\'\x05\tT\xe0\x85\x8c\x88Px\x94\x91\x11\xe5\x9eT$O\x8e\x16\x7f\xc3\xf5\x9e\x88$\xa9\xfe*\x9fZ\xf7\x85\xcc\xa3\xd2q\x1f%\xf5\xb2@y

And so on, so on...which is I think logical.

But when running the rest of code (urllib.request.urlopen + PDF read) I get the following:

AttributeError: 'bytes' object has no attribute 'timeout'

What do you guys think that is missing here? Perhaps some decoding missing when reading the file. Really need some help.

Thank you!

Solution

urllib.request.urlopen method only accepts url or request object. It's not able to pass bytes into this method.

You may have a try below code:

import urllib.request
import requests
import io
import PyPDF2 as p2
from requests_ntlm import HttpNtlmAuth

url = "http://sp10/Shared%20Documents/test.pdf"

response = requests.get(url, auth=HttpNtlmAuth("CONTOSO\Administrator","password"))

print(response.status_code)

PDFfile = io.BytesIO(response.content)

pdfread = p2.PdfFileReader(PDFfile)

NrPages = pdfread.getNumPages()

for i in range(NrPages):
  x = pdfread.getPage(i)
  y = str(x.extractText())
  print(y)