I decided to learn Python since I now have more time (due to pandemic) and have been teaching myself Python.
I am trying to scrape tax rates from a site and can get almost everything I need. Below is a snippet of the code that comes out of my Soup variable as well as the relevant piece of Python.
Where I am having difficulty is I am finding the option
tag along with the data-alias
that is empty (""). However, if you look at the code below there are some data-alias
stages that are not empty (see UAE or Great Britain)- they have some countries listed.
I am looking to get the data-url
and country name from these as well.
How do I code this to get empty tags and non-empty tags as I am losing some required information?
Thanks, Seth
My code:
import requests
from bs4 import BeautifulSoup
import re
l=[]
r = requests.get("https://taxsummaries.pwc.com/")
c=r.content
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("option", {"data-alias":""})
Website Information:
<option data-alias="" data-id="c9ddd85e-f3dc-4661-a4cb-8101f4644871" data-url="https://taxsummaries.pwc.com:443/uganda">Uganda</option>
<option data-alias="" data-id="d21e8abe-784c-4617-a90e-5369b49a202f" data-url="https://taxsummaries.pwc.com:443/ukraine">Ukraine</option>
<option data-alias="UAE" data-id="9e3f5e7b-f110-47dd-95d8-3d8160466e4a" data-url="https://taxsummaries.pwc.com:443/united-arab-emirates">United Arab Emirates</option>
<option data-alias="Great Britain
UK
Britain
Whales
Northern Ireland
England" data-id="3c42b2a9-7ed6-4b19-821d-5d78ef6f2b5d" data-url="https://taxsummaries.pwc.com:443/united-kingdom">United Kingdom</option>
You need to use {"data-alias":True}
. You can try it:
import requests
from bs4 import BeautifulSoup
l=[]
r = requests.get("https://taxsummaries.pwc.com/")
c=r.content
soup = BeautifulSoup(c, "html.parser")
options = soup.find_all('option', {"data-alias":True})
for each in options:
print("country_name : " + str(each.text), " data-url : " + str(each['data-url']))
Output will be:
country_name : Albania data-url : https://taxsummaries.pwc.com:443/albania
country_name : Algeria data-url : https://taxsummaries.pwc.com:443/algeria
country_name : Angola data-url : https://taxsummaries.pwc.com:443/angola
country_name : Argentina data-url : https://taxsummaries.pwc.com:443/argentina
country_name : Armenia data-url : https://taxsummaries.pwc.com:443/armenia
country_name : Australia data-url : https://taxsummaries.pwc.com:443/australia
country_name : Austria data-url : https://taxsummaries.pwc.com:443/austria
country_name : Azerbaijan data-url : https://taxsummaries.pwc.com:443/azerbaijan
country_name : Bahrain data-url : https://taxsummaries.pwc.com:443/bahrain
country_name : Barbados data-url : https://taxsummaries.pwc.com:443/barbados
country_name : Belarus data-url : https://taxsummaries.pwc.com:443/belarus
country_name : Belgium data-url : https://taxsummaries.pwc.com:443/belgium
country_name : Bermuda data-url : https://taxsummaries.pwc.com:443/bermuda
country_name : Bolivia data-url : https://taxsummaries.pwc.com:443/bolivia
country_name : Bosnia and Herzegovina data-url : https://taxsummaries.pwc.com:443/bosnia-and-herzegovina
country_name : Botswana data-url : https://taxsummaries.pwc.com:443/botswana
country_name : Brazil data-url : https://taxsummaries.pwc.com:443/brazil
country_name : Bulgaria data-url : https://taxsummaries.pwc.com:443/bulgaria
and so on ......
For getting as a list
:
for each in options:
l.append( str(each.text)+ " : " + str(each['data-url']))
print(l)
Output will be:
['Albania : https://taxsummaries.pwc.com:443/albania', 'Algeria : https://taxsummaries.pwc.com:443/algeria', 'Angola : https://taxsummaries.pwc.com:443/angola', 'Argentina : https://taxsummaries.pwc.com:443/argentina', 'Armenia : https://taxsummaries.pwc.com:443/armenia', 'Australia : https://taxsummaries.pwc.com:443/australia', 'Austria : https://taxsummaries.pwc.com:443/austria', 'Azerbaijan : https://taxsummaries.pwc.com:443/azerbaijan', 'Bahrain : https://taxsummaries.pwc.com:443/bahrain', 'Barbados : https://taxsummaries.pwc.com:443/barbados', 'Belarus : https://taxsummaries.pwc.com:443/belarus', 'Belgium : https://taxsummaries.pwc.com:443/belgium', 'Bermuda : https://taxsummaries.pwc.com:443/bermuda', 'Bolivia : https://taxsummaries.pwc.com:443/bolivia', 'Bosnia and Herzegovina : https://taxsummaries.pwc.com:443/bosnia-and-herzegovina', 'Botswana : https://taxsummaries.pwc.com:443/botswana', 'Brazil : https://taxsummaries.pwc.com:443/brazil', 'Bulgaria : https://taxsummaries.pwc.com:443/bulgaria', 'Cabo Verde : https://taxsummaries.pwc.com:443/cabo-verde', 'Cambodia : https://taxsummaries.pwc.com:443/cambodia', 'Cameroon, Republic of : https://taxsummaries.pwc.com:443/republic-of-cameroon',
and so on............]