Search code examples
pdfsharepointiourllibpython-3.7

Error while reading PDF on SharePoint page using Python 3.7


I'm using "PyPDF2", "urllib.request", "io" and "requests_ntlm"+"HttpNtlmAuth" to access a company SharePoint 2010 page and, read and print a PDF as a string.

I already had success by reading and printing a PDF on both a PDF file and an online PDF, but with SharePoint 2010 seems not to function.

This is my code:

import urllib.request
import requests
import io
import PyPDF2 as p2
from requests_ntlm import HttpNtlmAuth

url = "http://company.net/sites/folder1/folder2/folder3/folder4/sample.pdf"

response = requests.get(url, auth=HttpNtlmAuth("USER@MAIL.COM","PASSWORD"))

print(response.status_code)
print(response.content)

open = urllib.request.urlopen(response.content).read()

PDFfile = io.BytesIO(open)

pdfread = p2.PdfFileReader(PDFfile)
NrPages = pdfread.getNumPages()

for i in range(NrPages):
    x = pdfread.getPage(i)
    y = str(x.extractText())
    print(y)

Regarding "status_code", I got the HTTP 200 code, so I assume that access has been accepted. When printing the "response.content" I get something like this:

b'%PDF-1.3\r%\xff\xff\xff\xff\r1 0 obj\r<<\r/Title (\xfe\xff\x00D\x00S\x00-\x00T\x00B\x00L\x00-\x00M\x00A\x00X\x00-\x00D\x00P\x00L\x00S\x00-\x00S\x00C\x00-\x001\x004\x000\x000\x000\x00-\x001\x001\x004\x000\x000\x00-\x001\x001\x000\x00 \x00k\x00s\x00i\x00-\x00M\x00i\x00n\x00.\x00 \x00W\x00T\x00 \x009\x000\x00%\x00-\x00\\(\x00U\x00S\x00C\x00 \x00U\x00n\x00i\x00t\x00s\x00\\)\x00.\x00x\x00l\x00s\x00m)\r/Producer (Amyuni PDF Converter version 4.5.2.7)\r/CreationDate (D:20140624164324-03\'00\')\r>>\rendobj\r7 0 obj\n<< /Length 8 0 R /Filter /FlateDecode >>\nstream\nx\x9c\xb5Z[s\x9bH\x16~O\x95\xffC\xbfL\x8d\xb3\x15a\xfaB\xd3\xbc\xad/J\xc6\x93\xf82\x96\xbc\xaeT\xf9\x85H\xc8f#!\x05\xe1M\xfc\xa7\xb6\xf6\'\xee9M\x03\x8d\xa0\x91\xb23[S\xe3H\xa2\xf9\xf8\xce\xfd\x9cn\xbe\x1dQ\xe2\xc3\x7f\x94P\xe9E\x0c>\xcdVG\xd9\x11\xfe\xe4{\xd2g$\x7f:\x12\xa1\'\x05\tT\xe0\x85\x8c\x88Px\x94\x91\x11\xe5\x9eT$O\x8e\x16\x7f\xc3\xf5\x9e\x88$\xa9\xfe*\x9fZ\xf7\x85\xcc\xa3\xd2q\x1f%\xf5\xb2@y

And so on, so on...which is I think logical.

But when running the rest of code (urllib.request.urlopen + PDF read) I get the following:

AttributeError: 'bytes' object has no attribute 'timeout'

What do you guys think that is missing here? Perhaps some decoding missing when reading the file. Really need some help.

Thank you!


Solution

  • urllib.request.urlopen method only accepts url or request object. It's not able to pass bytes into this method.

    You may have a try below code:

    import urllib.request
    import requests
    import io
    import PyPDF2 as p2
    from requests_ntlm import HttpNtlmAuth
    
    url = "http://sp10/Shared%20Documents/test.pdf"
    
    response = requests.get(url, auth=HttpNtlmAuth("CONTOSO\Administrator","password"))
    
    print(response.status_code)
    
    PDFfile = io.BytesIO(response.content)
    
    pdfread = p2.PdfFileReader(PDFfile)
    
    NrPages = pdfread.getNumPages()
    
    for i in range(NrPages):
      x = pdfread.getPage(i)
      y = str(x.extractText())
      print(y)
    

    enter image description here