Beginner to BeautifulSoup, I am trying to extract the
Company Name, Rank, and Revenue from this wikipedia link.
https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies
The code I've used so far is:
from bs4 import BeautifulSoup
import requests
url = "https://en.wikiepdia.org"
req = requests.get(url)
bsObj = BeautifulSoup(req.text, "html.parser")
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
revenue=data.findAll('data-sort-value')
I realise that even 'data' is not working correctly as it holds no values when I pass it to the flask website.
Could someone please suggest a fix and the most elegant way to achieve the above as well as some suggestion to the best methodology for what we're looking for in the HTML when scraping (and the format).
On this link, https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies I am not sure what I am meant to use to extract - whether the table class, div class or body class. Furthermore how to go about the extractions of the link and revenue further down the tree.
I've also tried:
data = bsObj.find_all('table', class_='wikitable sortable mw-collapsible')
It runs the server with no errors. However, only an empty list is displayed on the webpage "[]"
Based on one answer below: I updated code to the below:
url = "https://en.wikiepdia.org"
req = requests.get(url)
bsObj = BeautifulSoup(req.text, "html.parser")
mydata=bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data=[]
rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
for row in rows:
cols=row.findAll('td')
row_data=[ele.text.strip() for ele in cols]
table_data.append(row_data)
data=table_data[0:10]
The persistent error is:
File "webscraper.py", line 15, in <module>
rows = mydata.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, kwargs='')('tr')
AttributeError: 'NoneType' object has no attribute 'findAll'
Based on answer below, it is now scraping the data, but not in the format asked for above:
I've got this:
url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies'
req = requests.get(url)
bsObj = BeautifulSoup(req.text, 'html.parser')
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data = []
rows = data.find_all('tr')
for row in rows:
cols = row.find_all('td')
row_data = [ele.text.strip() for ele in cols]
table_data.append(row_data)
# First element is header so that is why it is empty
data=table_data[0:5]
for in in range(5):
rank=data[i]
name=data[i+1]
For completeness (and a full answer) I'd like it to be displaying
-The first five companies in the table -The company name, the rank, the revenue
Currently it displays this:
Wikipedia
[[], ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'], ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'], ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'], ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]
['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']
['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]']
Here's an example using BeautifulSoup. A lot of the following is based on the answer here https://stackoverflow.com/a/23377804/6873133.
from bs4 import BeautifulSoup
import requests
url = 'https://en.m.wikipedia.org/wiki/List_of_largest_Internet_companies'
req = requests.get(url)
bsObj = BeautifulSoup(req.text, 'html.parser')
data = bsObj.find('table',{'class':'wikitable sortable mw-collapsible'})
table_data = []
rows = data.find_all('tr')
for row in rows:
cols = row.find_all('td')
row_data = [ele.text.strip() for ele in cols]
table_data.append(row_data)
# First element is header so that is why it is empty
table_data[0:5]
# [[],
# ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]'],
# ['2', 'Google', '$161.8', '2019', '118,899', '$921.14', 'Mountain View', '1998', '[3][4]'],
# ['3', 'JD.com', '$82.8', '2019', '220,000', '$51.51', 'Beijing', '1998', '[5][6]'],
# ['4', 'Facebook', '$70.69', '2019', '45,000', '$585.37', 'Menlo Park', '2004', '[7][8]']]
So isolate certain elements of this list, you just need to be mindful of the numerical index of the inner list. Here, let's look at the first few values for Amazon.
# The entire row for Amazon
table_data[1]
# ['1', 'Amazon', '$280.5', '2019', '798,000', '$920.22', 'Seattle', '1994', '[1][2]']
# Rank
table_data[1][0]
# '1'
# Company
table_data[1][1]
# 'Amazon'
# Revenue
table_data[1][2]
# '$280.5'
So to isolate just the first couple columns (rank, company, and revenue), you can run the following list comprehension.
iso_data = [tab[0:3] for tab in table_data]
iso_data[1:6]
# [['1', 'Amazon', '$280.5'], ['2', 'Google', '$161.8'], ['3', 'JD.com', '$82.8'], ['4', 'Facebook', '$70.69'], ['5', 'Alibaba', '$56.152']]
Then, if you want to put it into a pandas
data frame, you can do the following.
import pandas as pd
# The `1` here is important to remove the empty header
df = pd.DataFrame(table_data[1:], columns = ['Rank', 'Company', 'Revenue', 'F.Y.', 'Employees', 'Market cap', 'Headquarters', 'Founded', 'Refs'])
df
# Rank Company Revenue F.Y. Employees Market cap Headquarters Founded Refs
# 0 1 Amazon $280.5 2019 798,000 $920.22 Seattle 1994 [1][2]
# 1 2 Google $161.8 2019 118,899 $921.14 Mountain View 1998 [3][4]
# 2 3 JD.com $82.8 2019 220,000 $51.51 Beijing 1998 [5][6]
# 3 4 Facebook $70.69 2019 45,000 $585.37 Menlo Park 2004 [7][8]
# 4 5 Alibaba $56.152 2019 101,958 $570.95 Hangzhou 1999 [9][10]
# .. ... ... ... ... ... ... ... ... ...
# 75 77 Farfetch $1.02 2019 4,532 $3.51 London 2007 [138][139]
# 76 78 Yelp $1.01 2019 5,950 $2.48 San Francisco 1996 [140][141]
# 77 79 Vroom.com $1.1 2020 3,990 $5.2 New York City 2003 [142]
# 78 80 Craigslist $1.0 2018 1,000 - San Francisco 1995 [143]
# 79 81 DocuSign $1.0 2018 3,990 $10.62 San Francisco 2003 [144]
#
# [80 rows x 9 columns]