Search code examples
pythonwebscreen-scraping

Wiki scraping using python


I am trying to scrape the data stored in the table of this wikipedia page https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India). However i am unable to scrape the full data Hers's what i wrote so far:

from bs4 import BeautifulSoup
import urllib2
wiki = "https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")

name = ""
pic = ""
strt = ""
end = ""
pri = ""
x=""
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
    cells = row.findAll("td")

    if len(cells) == 8:
        name = cells[0].find(text=True)
        print name`

The output obtained is: Jairamdas Daulatram, Surjit Singh Barnala, Rao Birendra Singh

Whereas the output should be: Jairamdas Daulatram followed by Panjabrao Deshmukh


Solution

  • Have you read the raw html?

    Because some of the cells span several rows (e.g. Political Party), most rows do not have 8 cells in them.

    You cannot therefore do if len(cells) == 8 and expect it to work. Think about what this line was meant to achieve. If it was to ignore the header row then you could replace it with if len(cells) > 0 because all the header cells are <th> tags (and therefore will not appear in your list).

    Page source (showing your problem):

      <tr>
        <td><a href="/wiki/Jairamdas_Daulatram" title="Jairamdas Daulatram">Jairamdas Daulatram</a></td>
        <td></td>
        <td>1948</td>
        <td>1952</td>
        <td rowspan="6"><a href="/wiki/Indian_National_Congress" title="Indian National Congress">Indian National Congress</a></td>
        <td rowspan="6" bgcolor="#00BFFF" width="4px"></td>
        <td rowspan="3"><a href="/wiki/Jawaharlal_Nehru" title="Jawaharlal Nehru">Jawaharlal Nehru</a></td>
        <td><sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><span>[</span>1<span>]</span></a></sup></td>
      </tr>
      <tr>
        <td><a href="/wiki/Panjabrao_Deshmukh" title="Panjabrao Deshmukh">Panjabrao Deshmukh</a></td>
        <td></td>
        <td>1952</td>
        <td>1962</td>
        <td><sup id="cite_ref-2" class="reference"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup></td>
      </tr>