I scraped data from a website and I am having trouble cleaning this is the code, I scraped the data with. Is that best practice?
import requests
from bs4 import BeautifulSoup
import json
all_countries_links=[]
countries= []
all_data=[]
data_dict={}
data_value=[]
page1 = requests.get(f"https://data.un.org/")
def main(page):
source = page.content
soup = BeautifulSoup(source,'lxml')
all_page = soup.find("div",{"class","CountryList"}).find_all('a',href=True)
for link in all_page:
all_countries_links.append(link['href'])
countries. append(link.text.strip())
def scrape_country(all_countries_links,countries):
for country in all_countries_links[:2]:
page2 = requests.get(f"https://data.un.org/{country}")
source = page2.content
soup = BeautifulSoup(source,'lxml')
all_page= soup.find('ul',{'class','pure-menu-list'})
tables = all_page.contents
for table in tables:
line = table.text.strip()
all_data.append(line)
main(page1)
scrape_country(all_countries_links,countries)
file_path = "data.json"
with open(file_path, 'w') as f:
json.dump(all_data, f, indent=4)
print(f"Data saved to {file_path}")
This is a small example of the data after collecting it.
[
"",
"General Information\n\nRegion\u00a0\n\u00a0\nSouthern Asia\nPopulation\u00a0(000, 2021)\n\u00a0\n39 835a\nPop. density\u00a0(per km2, 2021)\n\u00a0\n61a\nCapital city\u00a0\n\u00a0\nKabul\nCapital city pop.\u00a0(000, 2021)\n\u00a0\n4 114.0b\nUN membership date\u00a0\n\u00a0\n19-Nov-46\nSurface area\u00a0(km2)\n\u00a0\n652 864b\nSex ratio\u00a0(m per 100 f)\n\u00a0\n105.3a\nNational currency\u00a0\n\u00a0\nAfghani (AFN)\nExchange rate\u00a0(per US$)\n\u00a0\n77.1c",
]
I tried to separate the data with this code:
cleaned_data =[]
# for line in cleaned_data:
# print(line.split('\n'))
# new_data = [line for line in all_data.split()]
for line in all_data[:1]:
for line2 in line.split():
if line2 not in ["General","Information","Economic"," indicators","Social"," indicators"]:
cleaned_data.append(line2)
But I was hoping to find a better way.
For this type of task I'd recommend pandas
.read_html()
function:
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
country_url = "https://data.un.org/en/iso/af.html"
soup = BeautifulSoup(requests.get(country_url).content, "html.parser")
for table in soup.select("details table"):
summary = table.find_previous("summary").text
df = pd.read_html(StringIO(str(table)))[0]
df["table_name"] = summary
print(df)
print("-" * 80)
Prints:
...
--------------------------------------------------------------------------------
Unnamed: 0 2010 2015 2021 table_name
0 GDP: Gross domestic product (million current US$) 14 699 18 713 17 877b Economic indicators
1 GDP growth rate (annual %, const. 2015 prices) 5.2 -1.4 4b Economic indicators
2 GDP per capita (current US$) 503.6 543.8 469.9b Economic indicators
3 Economy: Agriculture (% of Gross Value Added) 33.2 27.3 26.9d,b Economic indicators
4 Economy: Industry (% of Gross Value Added) 13 10.8 12.8e,f,b Economic indicators
5 Economy: Services and other activity (% of GVA) 53.8 61.9 60.4g,b Economic indicators
6 Employment in agricultureh (% of employed) 54.7 47.1 42.4c Economic indicators
7 Employment in industryh (% of employed) 14.4 17 18.3c Economic indicators
8 Employment in servicesh (% employed) 30.9 35.8 39.4c Economic indicators
9 Unemploymenth (% of labour force) 11.5 11.4 11.2c Economic indicators
10 Labour force participation rateh (female/male pop. %) 14.9 / 78.4 18.8 / 76.2 21.8 / 74.6c Economic indicators
11 CPI: Consumer Price Index (2010=100) 100 133 150b Economic indicators
12 Agricultural production index (2014-2016=100) 93 96 111b Economic indicators
13 International trade: exports (million current US$) 388 571 1 022h,c Economic indicators
14 International trade: imports (million current US$) 5 154 7 723 9 683h,c Economic indicators
15 International trade: balance (million current US$) - 4 766 - 7 151 - 8 661h,c Economic indicators
16 Balance of payments, current account (million US$) -578 - 4 193 - 3 137c Economic indicators
--------------------------------------------------------------------------------
Unnamed: 0 2010 2015 2021 table_name
0 Population growth ratei (average annual %) 2.6 3.3 2.5c Social indicators
1 Urban population (% of total population) 23.7 24.8 25.8b Social indicators
2 Urban population growth ratei (average annual %) 3.7 4 ... Social indicators
3 Fertility rate, totali (live births per woman) 6.5 5.4 4.6c Social indicators
4 Life expectancy at birthi (females/males, years) 61.0 / 58.3 63.8 / 60.9 65.8 / 62.8c Social indicators
5 Population age distribution (0-14/60+ years old, %) 48.2 / 3.9 44.9 / 4.0 41.2 / 4.3a Social indicators
6 International migrant stockj (000/% of total pop.) 102.3 / 0.4 339.4 / 1.0 144.1 / 0.4c Social indicators
7 Refugees and others of concern to UNHCR (000) 1 200.0k 1 421.4 2 802.9c Social indicators
8 Infant mortality ratei (per 1 000 live births) 72.2 60.1 51.7c Social indicators
9 Health: Current expenditure (% of GDP) 8.6 10.1 9.4l Social indicators
10 Health: Physicians (per 1 000 pop.) 0.2 0.3 0.3m Social indicators
11 Education: Government expenditure (% of GDP) 3.5 3.3 4.1h,n Social indicators
12 Education: Primary gross enrol. ratio (f/m per 100 pop.) 80.6 / 118.6 83.5 / 122.7 82.9 / 124.2l Social indicators
13 Education: Secondary gross enrol. ratio (f/m per 100 pop.) 33.3 / 66.9 36.8 / 65.9 40.0 / 70.1l Social indicators
14 Education: Upper secondary gross enrol. ratio (f/m per 100 pop.) 17.8 / 42.7 27.1 / 52.6 28.5 / 52.4l Social indicators
15 Intentional homicide rate (per 100 000 pop.) 3.4 9.8 6.7l Social indicators
16 Seats held by women in national parliaments (%) 27.3 27.7 27o Social indicators
--------------------------------------------------------------------------------
...