Search code examples
pythonhtmlscraper

(using Python) Putting outputs into 2 lists, and pair each elements in the 2 lists


I want to work out some scripts, using Python and BeautifulSoup to pick up some texts on a webpage, and put them nicely together. they ideal results are like:

Port_new_cape Jan 23, 2009 12:05
Brisbane July 24, 2002 03:12
Liaoning Aug 26, 2006 02:55

Because the webpage is in the company website requires authentication and redirection, I copy the source code of the target page into a file and save it as example.html in C:\ for convenience.

Part of the source code is quoted below (they are the target paragraphs and there are many more similar paragraphs):

<tr class="ghj">
    <td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a     href="./membercity.php?mode=view&amp;u=12563">Port_new_cape</a></td>
    <td class="position"><a href="./search.php?id=12563&amp;sr=positions" title="Search     positions">452</a></td>
    <td class="details"><div>South</div></td>
    <td>May 09, 1997</td>
    <td>Jan 23, 2009 12:05 pm&nbsp;</td>
</tr>

<tr class="ghj">
    <td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&amp;u=12563">Brisbane</a></td>
    <td class="position"><a href="./search.php?id=12563&amp;sr=positions" title="Search positions">356</a></td>
    <td class="details"><div>South</div></td>
    <td>Jun 09, 1986</td>
    <td>July 24, 2002 03:12 pm&nbsp;</td>
</tr>

<tr class="ghj">
    <td><span class="city-sh"><sh src="./citys/1.jpg" alt="boy" title="boy" /></span><a href="./membercity.php?mode=view&amp;u=12563">Liaoning</a></td>
    <td class="position"><a href="./search.php?id=12563&amp;sr=positions" title="Search positions">1105</a></td>
    <td class="details"><div>Southeast</div></td>
    <td>March 09, 2007</td>
    <td>Aug 26, 2006 02:55 pm&nbsp;</td>
</tr>

So far below is what I have (Some part of the scripts thanks to a gentleman’s help.):

from bs4 import BeautifulSoup
import re
import urllib2

url = r"C:\example.html"
page = open(url)
soup = BeautifulSoup(page.read())


#preparing the 1st list
cities = soup.find_all(href=re.compile("u="))

LST = []

for city in cities:
    ci = city.renderContents()
    full_list = LST.append(ci)


#preparing the 2nd list

Dates = soup.find_all('td', {'class' : 'info'})

LSTTT = []

counter = 0

while len(Dates) > counter:
    datesGroup = Dates[counter].find_next_siblings('td')
    if len(datesGroup) == 2:
        ti = datesGroup[1].renderContents()
        full_list_2 = LSTTT.append(ti)

    counter += 1


print full_list

print full_list_2

My idea is to put all the outputs into a 2 lists, and then combine each elements (they shall be correspondent one to one) in the 2 lists. However when I run the scripts, it produces 2 “NONE” lists.

My questions:

  1. What went wrong in the lists? Why they are “NONE”?
  2. How to combine the each elements in the 2 lists, once they are successful?

Many thanks.


Solution

  • list.append method returns None as it is an inplace operation. Instead of storing the result in other variables, you can use LSTT and LSTTT as they are, like this

    ci = city.renderContents()
    LST.append(ci)
    ...
    ...
        ti = datesGroup[1].renderContents()
        LSTTT.append(ti)
    ...
    ...
    print(zip(LSTT, LSTTT))
    

    The zip function returns a list of tuples of corresponding elements of all the input iterables.

    If you want to print the zipped result, withtout the tuples, you can iterate over them, like this

    for item1, item2 in zip(LSTT, LSTTT):
        print(item1, item2)