Search code examples
pythonparsingposttwistedbeautifulsoup

Python - Twisted : POST in a form


Hi Guys !

I'm still discovering Twisted and I've made this script to parse the content of HTML table into excel. This script is working well ! My question is how can I do the same, for only one webpage (http://bandscore.ielts.org/) but with a lot of POST requests to be able to fetch all the results, parse it with beautifulSoup and then put them into excel ?

Parsing the source and putting it into excel is O.K, but I don't know how to do a POST request with Twisted in order to implement that in

This is the script I use for parsing (with Twisted) a lot of different pages (I want to be able to write the same script, but with a lot of different POST data on the same page and not a lot of pages):

from twisted.web import client
from twisted.internet import reactor, defer
from bs4 import BeautifulSoup as BeautifulSoup
import time
import xlwt

start = time.time()
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet("BULATS_IA_PARSED")
global x
x = 0
Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
urls = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List]


def finish(results):
    global x
    for result in results:
        print 'GOT PAGE', len(result), 'bytes'
        soup = BeautifulSoup(result)
        tableau = soup.findAll('table')
    try:
        rows = tableau[3].findAll('tr')
        print("Fetching")
        for tr in rows:
        cols = tr.findAll('td')
        y = 0
        x = x + 1
        for td in cols:
            texte_bu = td.text
            texte_bu = texte_bu.encode('utf-8')
            #print("Writing...")
                    #print texte_bu
            ws.write(x,y,td.text)
            y = y + 1
    except(IndexError):
        print("No IA for this country")
        pass

    reactor.stop()

waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)

reactor.run()
wb.save("IALOL.xls")
print "Elapsed Time: %s" % (time.time() - start)

Thank you very much in advance for your help !


Solution

  • You have two options. Keep using getPage and tell it to POST instead of GET or use Agent.

    The API documentation for getPage directs you to the API documentation for HTTPClientFactory to discover additional supported options.

    The latter API documentation explicitly covers method and implies (but does a bad job of explaining) postdata. So, to make a POST with getPage:

    d = getPage(url, method='POST', postdata="hello, world, or whatever.")
    

    There is a howto-style document for Agent (linked from the overall web howto documentation index. This gives examples of sending a request with a body (ie, see the FileBodyProducer example).