Search code examples
pythonpython-2.7web-scrapingbeautifulsouppython-docx

How to add pictures to docx python from URL?


I am having trouble with the Python Docx Library, I have scraped images from a website and I want to add them to docx but I cannot add the images to docx directly, I keep getting an error:

File "C:\Python27\lib\site-packages\docx\image\image.py", line 46, in from_file with open(path, 'rb') as f: IOError: [Errno 22] invalid mode ('rb') or filename: 'http://upsats.com/Content/Product/img/Product/Thumb/PCB2x8-.jpg'

This is my code:

import urllib
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches
import os


    document = Document()

    document.add_heading("Megatronics Items Full Search", 0)


    FullPage = ['New-Arrivals-2017-6', 'Big-Sales-click-here', 'Arduino-Development-boards',
                'Robotics-and-Copters', 'Breakout-Boards', 'RC-Wireless-communication', 'GSM,-GPS,-RFID,-Wifi',
                'Advance-Development-boards-and-starter-Kits', 'Sensors-and-IMU', 'Solenoid-valves,-Relays,--Switches',
                'Motors,-drivers,-wheels', 'Microcontrollers-and-Educational-items', 'Arduino-Shields',
                'Connectivity-Interfaces', 'Power-supplies,-Batteries-and-Chargers', 'Programmers-and-debuggers',
                'LCD,-LED,-Cameras', 'Discrete-components-IC', 'Science-Education-and-DIY', 'Consumer-Electronics-and-tools',
                'Mechanical-parts', '3D-Printing-and-CNC-machines', 'ATS', 'UPS', 'Internal-Battries-UPS',
                'External-Battries-UPS']

    urlp1 = "http://www.arduinopak.com/Prd.aspx?Cat_Name="
    URL = urlp1 + FullPage[0]

    for n in FullPage:
        URL = urlp1 + n
        page = urllib.urlopen(URL)
        bsObj = BeautifulSoup(page, "lxml")
        panel = bsObj.findAll("div", {"class": "panel"})

        for div in panel:
            titleList = div.find('div', attrs={'class': 'panel-heading'})
            imageList = div.find('div', attrs={'class': 'pro-image'})
            descList = div.find('div', attrs={'class': 'pro-desc'})

            r = requests.get("http://upsats.com/", stream=True)
            data = r.text

            for link in imageList.find_all('img'):
                image = link.get("src")
                image_name = os.path.split(image)[1]
                r2 = requests.get(image)
                with open(image_name, "wb") as f:
                    f.write(r2.content)

                print(titleList.get_text(separator=u' '))
                print(imageList.get_text(separator=u''))
                print(descList.get_text(separator=u' '))
                document.add_heading("%s \n" % titleList.get_text(separator=u' '))
                document.add_picture(image, width=Inches(1.5))
                document.add_paragraph("%s \n" % descList.get_text(separator=u' '))

    document.save('megapy.docx')

Not all of it but just the main part. Now, I am having problems copying the pictures that I downloaded, I want to copy it to docx. I do not know how to add the picture. How do I convert it? I think I have to format it but how do I do that?

All I know is the problem lies within this code:

document.add_picture(image, width=Inches(1.0))

How do I make this image show up in docx from the URL? What am I missing?


Solution

  • Update

    I did a test with 10 images and I got a docx. When loading many I had an error at one place and I overwrote that by adding a try, except (see below). The resulting megapy.docx got 165 MB big and took about 10 minutes to create.

    with open(image_name, "wb") as f:
        f.write(r2.content)
    

    To:

    image = io.BytesIO(r2.content)
    

    And added:

    try:
        document.add_picture(image, width=Inches(1.5))
    except:
        pass
    

    enter image description here


    Use io library to create file-like ojects.

    Example that works on python2&3:

    import requests
    import io
    from docx import Document
    from docx.shared import Inches
    
    url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Usain_Bolt_Rio_100m_final_2016k.jpg/200px-Usain_Bolt_Rio_100m_final_2016k.jpg'
    response = requests.get(url, stream=True)
    image = io.BytesIO(response.content)
    
    document = Document()
    document.add_picture(image, width=Inches(1.25))
    document.save('demo.docx')
    

    enter image description here