Hi Guys !
I'm still discovering Twisted and I've made this script to parse the content of HTML table into excel. This script is working well ! My question is how can I do the same, for only one webpage (http://bandscore.ielts.org/) but with a lot of POST requests to be able to fetch all the results, parse it with beautifulSoup and then put them into excel ?
Parsing the source and putting it into excel is O.K, but I don't know how to do a POST request with Twisted in order to implement that in
This is the script I use for parsing (with Twisted) a lot of different pages (I want to be able to write the same script, but with a lot of different POST data on the same page and not a lot of pages):
from twisted.web import client
from twisted.internet import reactor, defer
from bs4 import BeautifulSoup as BeautifulSoup
import time
import xlwt
start = time.time()
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet("BULATS_IA_PARSED")
global x
x = 0
Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
urls = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List]
def finish(results):
global x
for result in results:
print 'GOT PAGE', len(result), 'bytes'
soup = BeautifulSoup(result)
tableau = soup.findAll('table')
try:
rows = tableau[3].findAll('tr')
print("Fetching")
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
texte_bu = td.text
texte_bu = texte_bu.encode('utf-8')
#print("Writing...")
#print texte_bu
ws.write(x,y,td.text)
y = y + 1
except(IndexError):
print("No IA for this country")
pass
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
wb.save("IALOL.xls")
print "Elapsed Time: %s" % (time.time() - start)
Thank you very much in advance for your help !
You have two options. Keep using getPage
and tell it to POST instead of GET or use Agent
.
The API documentation for getPage
directs you to the API documentation for HTTPClientFactory
to discover additional supported options.
The latter API documentation explicitly covers method
and implies (but does a bad job of explaining) postdata
. So, to make a POST with getPage
:
d = getPage(url, method='POST', postdata="hello, world, or whatever.")
There is a howto-style document for Agent
(linked from the overall web howto documentation index. This gives examples of sending a request with a body (ie, see the FileBodyProducer
example).