I want to work out some scripts, using Python and BeautifulSoup to pick up some texts on a webpage, and put them nicely together. they ideal results are like:
Port_new_cape Jan 23, 2009 12:05
Brisbane July 24, 2002 03:12
Liaoning Aug 26, 2006 02:55
Because the webpage is in the company website requires authentication and redirection, I copy the source code of the target page into a file and save it as example.html
in C:\ for convenience.
Part of the source code is quoted below (they are the target paragraphs and there are many more similar paragraphs):
<tr class="ghj">
<td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&u=12563">Port_new_cape</a></td>
<td class="position"><a href="./search.php?id=12563&sr=positions" title="Search positions">452</a></td>
<td class="details"><div>South</div></td>
<td>May 09, 1997</td>
<td>Jan 23, 2009 12:05 pm </td>
</tr>
<tr class="ghj">
<td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&u=12563">Brisbane</a></td>
<td class="position"><a href="./search.php?id=12563&sr=positions" title="Search positions">356</a></td>
<td class="details"><div>South</div></td>
<td>Jun 09, 1986</td>
<td>July 24, 2002 03:12 pm </td>
</tr>
<tr class="ghj">
<td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&u=12563">Liaoning</a></td>
<td class="position"><a href="./search.php?id=12563&sr=positions" title="Search positions">1105</a></td>
<td class="details"><div>Southeast</div></td>
<td>March 09, 2007</td>
<td>Aug 26, 2006 02:55 pm </td>
</tr>
So far below is what I have (Some part of the scripts thanks to a gentleman’s help.):
from bs4 import BeautifulSoup
import re
import urllib2
url = r"C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())
#preparing the 1st list
cities = soup.find_all(href=re.compile("u="))
LST = []
for city in cities:
ci = city.renderContents()
full_list = LST.append(ci)
#preparing the 2nd list
Dates = soup.find_all('td', {'class' : 'info'})
LSTTT = []
counter = 0
while len(Dates) > counter:
datesGroup = Dates[counter].find_next_siblings('td')
if len(datesGroup) == 2:
ti = datesGroup[1].renderContents()
full_list_2 = LSTTT.append(ti)
counter += 1
print full_list
print full_list_2
My idea is to put all the outputs into a 2 lists, and then combine each elements (they shall be correspondent one to one) in the 2 lists. However when I run the scripts, it produces 2 “NONE” lists.
My questions:
Many thanks.
list.append
method returns None
as it is an inplace operation. Instead of storing the result in other variables, you can use LSTT
and LSTTT
as they are, like this
ci = city.renderContents()
LST.append(ci)
...
...
ti = datesGroup[1].renderContents()
LSTTT.append(ti)
...
...
print(zip(LSTT, LSTTT))
The zip
function returns a list of tuples of corresponding elements of all the input iterables.
If you want to print the zipped result, withtout the tuples, you can iterate over them, like this
for item1, item2 in zip(LSTT, LSTTT):
print(item1, item2)