Search code examples
pythonregexbeautifulsoupurlopen

How to extract URL from an HTML


I'm a newbie in web scraping. I do as below

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar")
soup = BeautifulSoup(html, "html.parser")
res = soup.find_all('a', {'href': re.compile("r'\b?20\b'")})
print (res)

and get

[]

My goal is this fragment

<script language="javascript" type="text/javascript">
cont = new Array();
count = new Array();
for (i=1979; i <=2015; i++){count[i]=0};
cont[1979] =    "<li><a href='?1979_1#24jan'>24 января</a>" +  

..............

cont[2016] =    "<li><a href='?2016/2016_spr#cur'>Весенняя серия</a>" +
        "<li><a href='?2016/2016_sum#cur'>Летняя серия</a>" +
        "<li><a href='?2016/2016_aut#cur'>Осенняя серия</a>" +
        "<li><a href='?2016/2016_win#cur'>Зимняя серия</a>";

And i try to get the result like this

'?2016/2016_spr#cur' 
'?2016/2016_sum#cur'
'?2016/2016_aut#cur'
'?2016/2016_win#cur'

From 2000 to this moment (so '20' in "r'\b?20\b'" is for this reason). Can you help me, please?


Solution

  • Preliminaries:

    >>> import requests
    >>> import bs4
    >>> page = requests.get('http://chgk.tvigra.ru/letopis/?2016/2016_spr#27mar').content
    >>> soup = bs4.BeautifulSoup(page, 'lxml')
    

    Having done this it might seem that the most straightforward way of identifying the script element might be to use this:

    >>> scripts = soup.findAll('script', text=bs4.re.compile('cont = new Array();'))
    

    However, scripts proves to be an empty list. (I don't know why.)

    The basic approach works, if I choose a different target within the script but it would appear the it's unsafe to depend on the exact formatting of the content of Javascript script element.

    >>> scripts = soup.find_all(string=bs4.re.compile('i=1979'))
    >>> len(scripts)
    1
    

    Still, this might be good enough for you. Please just notice that the script has the change function at the end to be discarded.

    A safer approach might be to look for the containing table element, then the second td element within that and finally the script within that.

    >>> table = soup.find_all('table', class_='common_table')
    >>> tds = table[0].findAll('td')[1]
    >>> script = tds.find('script')
    

    Again, you will need to discard function change.