Search code examples
pythonweb-scrapingbeautifulsoupwikipedia

Beautiful Soup and scraping wikipedia entries:


Beginner to BeautifulSoup, I am trying to extract the

Company Name, Rank, and Revenue from this wikipedia link.

https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies

The code I've used so far is:

from bs4 import BeautifulSoup 
import requests 
url = "https://en.wikiepdia.org" 
req = requests.get(url) 
bsObj = BeautifulSoup(req.text, "html.parser") 
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
revenue=data.findAll('data-sort-value')

I realise that even 'data' is not working correctly as it holds no values when I pass it to the flask website.

Could someone please suggest a fix and the most elegant way to achieve the above as well as some suggestion to the best methodology for what we're looking for in the HTML when scraping (and the format).

On this link, https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies I am not sure what I am meant to use to extract - whether the table class, div class or body class. Furthermore how to go about the extractions of the link and revenue further down the tree.

I've also tried:

data = bsObj.find_all('table', class_='wikitable sortable mw-collapsible')

It runs the server with no errors. However, only an empty list is displayed on the webpage "[]"

Based on one answer below: I updated code to the below:

url = "https://en.wikiepdia.org" 
req = requests.get(url) 
bsObj = BeautifulSoup(req.text, "html.parser") 
mydata=bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data=[]
rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
for row in rows:
    cols=row.findAll('td')
    row_data=[ele.text.strip() for ele in cols]
    table_data.append(row_data)

data=table_data[0:10]

The persistent error is:

 File "webscraper.py", line 15, in <module>
    rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
AttributeError: 'NoneType' object has no attribute 'findAll'

Based on answer below, it is now scraping the data, but not in the format asked for above:

I've got this:

url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies' 
req = requests.get(url) 
bsObj = BeautifulSoup(req.text, 'html.parser')
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})

table_data = []
rows = data.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    row_data = [ele.text.strip() for ele in cols]
    table_data.append(row_data)

# First element is header so that is why it is empty
data=table_data[0:5]

for in in range(5):
    rank=data[i]
    name=data[i+1]

For completeness (and a full answer) I'd like it to be displaying

-The first five companies in the table -The company name, the rank, the revenue

Currently it displays this:

Wikipedia

[[], ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'], ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'], ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'], ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]

['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']

['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]']


Solution

  • Here's an example using BeautifulSoup. A lot of the following is based on the answer here https://stackoverflow.com/a/23377804/6873133.

    from bs4 import BeautifulSoup 
    import requests
    
    url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies' 
    req = requests.get(url) 
    
    bsObj = BeautifulSoup(req.text, 'html.parser')
    data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
    
    table_data = []
    rows = data.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        row_data = [ele.text.strip() for ele in cols]
        table_data.append(row_data)
    
    # First element is header so that is why it is empty
    table_data[0:5]
    # [[],
    #  ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'],
    #  ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'],
    #  ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'],
    #  ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]
    

    So isolate certain elements of this list, you just need to be mindful of the numerical index of the inner list. Here, let's look at the first few values for Amazon.

    # The entire row for Amazon
    table_data[1]
    # ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']
    
    # Rank
    table_data[1][0]
    # '1'
    
    # Company
    table_data[1][1]
    # 'Amazon'
    
    # Revenue
    table_data[1][2]
    # '$280.5'
    

    So to isolate just the first couple columns (rank, company, and revenue), you can run the following list comprehension.

    iso_data = [tab[0:3] for tab in table_data]
    
    iso_data[1:6]
    # [['1', 'Amazon', '$280.5'], ['2', 'Google', '$161.8'], ['3', 'JD.com', '$82.8'], ['4', 'Facebook', '$70.69'], ['5', 'Alibaba', '$56.152']]
    

    Then, if you want to put it into a pandas data frame, you can do the following.

    import pandas as pd
    
    # The `1` here is important to remove the empty header
    df = pd.DataFrame(table_data[1:], columns = ['Rank', 'Company', 'Revenue', 'F.Y.', 'Employees', 'Market cap', 'Headquarters', 'Founded', 'Refs'])
    
    df
    #    Rank     Company  Revenue  F.Y. Employees Market cap   Headquarters Founded        Refs
    # 0     1      Amazon   $280.5  2019   798,000    $920.22        Seattle    1994      [1][2]
    # 1     2      Google   $161.8  2019   118,899    $921.14  Mountain View    1998      [3][4]
    # 2     3      JD.com    $82.8  2019   220,000     $51.51        Beijing    1998      [5][6]
    # 3     4    Facebook   $70.69  2019    45,000    $585.37     Menlo Park    2004      [7][8]
    # 4     5     Alibaba  $56.152  2019   101,958    $570.95       Hangzhou    1999     [9][10]
    # ..  ...         ...      ...   ...       ...        ...            ...     ...         ...
    # 75   77    Farfetch    $1.02  2019     4,532      $3.51         London    2007  [138][139]
    # 76   78        Yelp    $1.01  2019     5,950      $2.48  San Francisco    1996  [140][141]
    # 77   79   Vroom.com     $1.1  2020     3,990       $5.2  New York City    2003       [142]
    # 78   80  Craigslist     $1.0  2018     1,000          -  San Francisco    1995       [143]
    # 79   81    DocuSign     $1.0  2018     3,990     $10.62  San Francisco    2003       [144]
    # 
    # [80 rows x 9 columns]