Search code examples
pythonamazon-web-servicesflaskamazon-s3python-docx

Reading a docx file from s3 bucket with flask results in an AttributeError


I got so many different errors, I don't even know which is pertinent to mention but it's not about the credentials because I can upload files already and I can read a txt file. Now I want to read a docx.

I created a form in my index.html with just a text area to write the exact name of the file and a submit input that will open a new window with the number of paragraphs from my docx file in my AWS S3 bucket.

The error I'm getting is this:

AttributeError: 'StreamingBody' object has no attribute 'seek'

My code looks like this:

path = "s3://***bucket/"
bucket_name = "***bucket"

@app.route('/resultfiles', methods=["POST"])
def getdata():
    thefilename = request.form['file_name']

    if '.docx' in thefilename:
        
        object_key = thefilename
        file_object = client.get_object(Bucket=bucket_name, Key=object_key)
        body = file_object['Body']
        
        doc = docx.Document(body)
        docx_paras = len(doc.paragraphs)
    
    return render_template('resultfiles.html', docx_paras=docx_paras)


Solution

  • I checked out the documentation for python-docx, specifically the Document-constructor:

    docx.Document(docx=None)

    Return a Document object loaded from docx, where docx can be either a path to a .docx file (a string) or a file-like object. If docx is missing or None, the built-in default document “template” is loaded.

    It seems to expect a file-like object or the path to a file. We can turn the different representations we get from boto3 into a file-like object, here's some sample code:

    import io
    
    import boto3
    import docx
    
    BUCKET_NAME = "my-bucket"
    
    def main():
        s3 = boto3.resource("s3")
        bucket = s3.Bucket(BUCKET_NAME)
    
        object_in_s3 = bucket.Object("test.docx")
        object_as_streaming_body = object_in_s3.get()["Body"]
        print(f"Type of object_as_streaming_body: {type(object_as_streaming_body)}")
        object_as_bytes = object_as_streaming_body.read()
        print(f"Type of object_as_bytes: {type(object_as_bytes)}")
    
        # Now we use BytesIO to create a file-like object from our byte-stream
        object_as_file_like = io.BytesIO(object_as_bytes)
        
        # Et voila!
        document = docx.Document(docx=object_as_file_like)
    
        print(document.paragraphs)
    
    if __name__ == "__main__":
        main()
    

    This is what it looks like:

    $ python test.py
    Type of object_as_streaming_body: <class 'botocore.response.StreamingBody'>
    Type of object_as_bytes: <class 'bytes'>
    [<docx.text.paragraph.Paragraph object at 0x00000258B7C34A30>]