Search code examples
pythonbeautifulsouphtml-parsing

Python -- Beautiful Soup -- Returning Information if a tag is empty or has values


I decided to learn Python since I now have more time (due to pandemic) and have been teaching myself Python.

I am trying to scrape tax rates from a site and can get almost everything I need. Below is a snippet of the code that comes out of my Soup variable as well as the relevant piece of Python.

Where I am having difficulty is I am finding the option tag along with the data-alias that is empty (""). However, if you look at the code below there are some data-alias stages that are not empty (see UAE or Great Britain)- they have some countries listed.

I am looking to get the data-url and country name from these as well.

How do I code this to get empty tags and non-empty tags as I am losing some required information?

Thanks, Seth

My code:

import requests
from bs4 import BeautifulSoup
import re

l=[]
r = requests.get("https://taxsummaries.pwc.com/")
c=r.content
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("option", {"data-alias":""})
    

Website Information:

<option data-alias="" data-id="c9ddd85e-f3dc-4661-a4cb-8101f4644871" data-url="https://taxsummaries.pwc.com:443/uganda">Uganda</option>
<option data-alias="" data-id="d21e8abe-784c-4617-a90e-5369b49a202f" data-url="https://taxsummaries.pwc.com:443/ukraine">Ukraine</option>
<option data-alias="UAE" data-id="9e3f5e7b-f110-47dd-95d8-3d8160466e4a" data-url="https://taxsummaries.pwc.com:443/united-arab-emirates">United Arab Emirates</option>
<option data-alias="Great Britain
UK
Britain
Whales
Northern Ireland
England" data-id="3c42b2a9-7ed6-4b19-821d-5d78ef6f2b5d" data-url="https://taxsummaries.pwc.com:443/united-kingdom">United Kingdom</option>


Solution

  • You need to use {"data-alias":True}. You can try it:

    import requests
    from bs4 import BeautifulSoup
    l=[]
    r = requests.get("https://taxsummaries.pwc.com/")
    c=r.content
    soup = BeautifulSoup(c, "html.parser")
    options = soup.find_all('option', {"data-alias":True})
    for each in options:
        print("country_name : " + str(each.text), " data-url : " + str(each['data-url']))
    

    Output will be:

    country_name : Albania  data-url : https://taxsummaries.pwc.com:443/albania
    country_name : Algeria  data-url : https://taxsummaries.pwc.com:443/algeria
    country_name : Angola  data-url : https://taxsummaries.pwc.com:443/angola
    country_name : Argentina  data-url : https://taxsummaries.pwc.com:443/argentina
    country_name : Armenia  data-url : https://taxsummaries.pwc.com:443/armenia
    country_name : Australia  data-url : https://taxsummaries.pwc.com:443/australia
    country_name : Austria  data-url : https://taxsummaries.pwc.com:443/austria
    country_name : Azerbaijan  data-url : https://taxsummaries.pwc.com:443/azerbaijan
    country_name : Bahrain  data-url : https://taxsummaries.pwc.com:443/bahrain
    country_name : Barbados  data-url : https://taxsummaries.pwc.com:443/barbados
    country_name : Belarus  data-url : https://taxsummaries.pwc.com:443/belarus
    country_name : Belgium  data-url : https://taxsummaries.pwc.com:443/belgium
    country_name : Bermuda  data-url : https://taxsummaries.pwc.com:443/bermuda
    country_name : Bolivia  data-url : https://taxsummaries.pwc.com:443/bolivia
    country_name : Bosnia and Herzegovina  data-url : https://taxsummaries.pwc.com:443/bosnia-and-herzegovina
    country_name : Botswana  data-url : https://taxsummaries.pwc.com:443/botswana
    country_name : Brazil  data-url : https://taxsummaries.pwc.com:443/brazil
    country_name : Bulgaria  data-url : https://taxsummaries.pwc.com:443/bulgaria
    
    
    and so on ......
    

    For getting as a list:

    for each in options:
        l.append( str(each.text)+ " : " + str(each['data-url']))
    print(l)
    

    Output will be:

    ['Albania : https://taxsummaries.pwc.com:443/albania', 'Algeria : https://taxsummaries.pwc.com:443/algeria', 'Angola : https://taxsummaries.pwc.com:443/angola', 'Argentina : https://taxsummaries.pwc.com:443/argentina', 'Armenia : https://taxsummaries.pwc.com:443/armenia', 'Australia : https://taxsummaries.pwc.com:443/australia', 'Austria : https://taxsummaries.pwc.com:443/austria', 'Azerbaijan : https://taxsummaries.pwc.com:443/azerbaijan', 'Bahrain : https://taxsummaries.pwc.com:443/bahrain', 'Barbados : https://taxsummaries.pwc.com:443/barbados', 'Belarus : https://taxsummaries.pwc.com:443/belarus', 'Belgium : https://taxsummaries.pwc.com:443/belgium', 'Bermuda : https://taxsummaries.pwc.com:443/bermuda', 'Bolivia : https://taxsummaries.pwc.com:443/bolivia', 'Bosnia and Herzegovina : https://taxsummaries.pwc.com:443/bosnia-and-herzegovina', 'Botswana : https://taxsummaries.pwc.com:443/botswana', 'Brazil : https://taxsummaries.pwc.com:443/brazil', 'Bulgaria : https://taxsummaries.pwc.com:443/bulgaria', 'Cabo Verde : https://taxsummaries.pwc.com:443/cabo-verde', 'Cambodia : https://taxsummaries.pwc.com:443/cambodia', 'Cameroon, Republic of : https://taxsummaries.pwc.com:443/republic-of-cameroon',
    
    
    and so on............]