Search code examples
pythonbeautifulsouplxmlurllib2urllib

Multithreading for faster downloading


How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script.

The Python script:

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

The HTML file called links.html:

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>

Solution

  • It looks to me like the consumer - producer problem - see wikipedia

    You may use

    import Queue, thread
    
    # create a Queue.Queue here
    queue = Queue.Queue()
    
    print ("downloading and parsing Bibles...")
    root = html.parse(open('links.html'))
    for link in root.findall('//a'):
      url = link.get('href')
      queue.put(url) # produce
    
    
    
    
    def thrad():
      url = queue.get() # consume
      name = urlparse.urlparse(url).path.split('/')[-1]
      dirname = urlparse.urlparse(url).path.split('.')[-1]
      f = urllib2.urlopen(url)
      s = f.read()
      if (os.path.isdir(dirname) == 0): 
        os.mkdir(dirname)
      soup = BeautifulSoup(s)
      articleTag = soup.html.body.article
      converted = str(articleTag)
      full_path = os.path.join(dirname, name)
      open(full_path, 'wb').write(converted)
      print(name)
    
    thread.start_new(thrad, ()) # start 1 threads