Search code examples
pythonhtmlbeautifulsouptabpage

Unable to crawl some href in a webpage using python and beautifulsoup


I am currently crawling a web page using Python 3.4 and bs4 in order to collect the match results played by Serbia in Rio2016. So the url here contains links to all the match results she played, for example this.

Then I found that the link is located in the html source like this:

<a href="/en/volleyball/women/7168-serbia-italy/post" ng-href="/en/volleyball/women/7168-serbia-italy/post">
    <span class="score ng-binding">3 - 0</span>
</a>

But after several trials, this href="/en/volleyball/women/7168-serbia-italy/post" never show up. Then I tried to run the following code to get all the href from the url:

from bs4 import BeautifulSoup
import requests

Countryr = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
countrySoup = BeautifulSoup(Countryr.text)

for link in countrySoup.find_all('a'):
    print(link.get('href'))

Then a strange thing happened. The href="/en/volleyball/women/7168-serbia-italy/post" is not included in the output at all.

I found that this href is located in one of the tab pages href="#scheduldedOver" in side this url, and it is controlled by the following HTML code:

<nav class="tabnav">
    <a href="#schedulded" ng-class="{selected: chosenStatus == 'Pre' }" ng-click="setStatus('Pre')" ng-href="#schedulded">Scheduled</a>
    <a href="#scheduldedLive" ng-class="{selected: chosenStatus == 'Live' }" ng-click="setStatus('Live')" ng-href="#scheduldedLive">Live</a>
    <a href="#scheduldedOver" class="selected" ng-class="{selected: chosenStatus == 'Over' }" ng-click="setStatus('Over')" ng-href="#scheduldedOver">Complete</a>
</nav>

Then how should I get the href using BeautifulSoup inside a tab page?


Solution

  • The data is created dynamically, if you look at the actual source you can see Angularjs templating.

    You can still get all the info in json format by mimicking an ajax call, in the source yuuuuou can also see a div like:

    <div id="AngularPanel" class="main-wrapper" ng-app="fivb"
    data-servicematchcenterbar="/en/api/volley/matches/341/en/user/lives"
    data-serviceteammatches="/en/api/volley/matches/WOG2016/en/user/team/3017"
    data-servicelabels="/en/api/labels/Volley/en" 
    data-servicelive="/en/api/volley/matches/WOG2016/en/user/live/">
    

    Using the data-servicematchcenterbar href will give you all the info:

    from bs4 import BeautifulSoup
    import requests
    from urlparse import urljoin
    
    r = requests.get('http://rio2016.fivb.com/en/volleyball/women/teams/srb-serbia#wcbody_0_wcgridpadgridpad1_1_wcmenucontent_3_Schedule')
    soup = BeautifulSoup(r.content)
    
    base = "http://rio2016.fivb.com/"
    
    json = requests.get(urljoin(base, soup.select_one("#AngularPanel")["data-serviceteammatches"])).json()
    

    In json you will see output like:

    {"Id": 7168, "MatchNumber": "006", "TournamentCode": "WOG2016", "TournamentName": "Women's Olympic Games 2016",
            "TournamentGroupName": "", "Gender": "", "LocalDateTime": "2016-08-06T22:35:00",
            "UtcDateTime": "2016-08-07T01:35:00+00:00", "CalculatedMatchDate": "2016-08-07T03:35:00+02:00",
            "CalculatedMatchDateType": "user", "LocalDateTimeText": "August 06 2016",
            "Pool": {"Code": "B", "Name": "Pool B", "Url": "/en/volleyball/women/results and ranking/round1#anchorB"},
            "Round": 68,
            "Location": {"Arena": "Maracanãzinho", "City": "Maracanãzinho", "CityUrl": "", "Country": "Brazil"},
            "TeamA": {"Code": "SRB", "Name": "Serbia", "Url": "/en/volleyball/women/teams/srb-serbia",
                      "FlagUrl": "/~/media/flags/flag_SRB.png?h=60&w=60"},
            "TeamB": {"Code": "ITA", "Name": "Italy", "Url": "/en/volleyball/women/teams/ita-italy",
                      "FlagUrl": "/~/media/flags/flag_ITA.png?h=60&w=60"},
            "Url": "/en/volleyball/women/7168-serbia-italy/post", "TicketUrl": "", "Status": "Over", "MatchPointsA": 3,
            "MatchPointsB": 0, "Sets": [{"Number": 1, "PointsA": 27, "PointsB": 25, "Hours": 0, "Minutes": "28"},
                                        {"Number": 2, "PointsA": 25, "PointsB": 20, "Hours": 0, "Minutes": "25"},
                                        {"Number": 3, "PointsA": 25, "PointsB": 23, "Hours": 0, "Minutes": "27"}],
            "PoolRoundName": "Preliminary Round", "DayInfo": "Weekend Day",
            "WeekInfo": {"Number": 31, "Start": 7, "End": 13}, "LiveStreamUri": ""},
    

    You can parse whatever you need from those.