I have this project am working on using python 3.4. I want to scrape livescore.com for football scores (result) e.g getting all the scores of the day (England 2-2 Norway, France 2-1 Italy, etc). I am building it with python 3.4, windows 10 64bit os.
I have tried two ways this are the codes:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.livescore.com/').read()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.find_all('div', class_='container'):
print(div.text)
When I run this code a box pup's up saying:
IDLE's subprocess didn't make connection. Either IDLE can't start a subprocess or firewall software is blocking the connection.
I decided to write another one this is the code:
# Import Modules
import urllib.request
import re
# Downloading Live Score XML Code From Website and reading also
xml_data = urllib.request.urlopen('http://static.cricinfo.com/rss/livescores.xml').read()
# Pattern For Searching Score and link
pattern = "<item>(.*?)</item>"
# Finding Matches
for i in re.findall(pattern, xml_data, re.DOTALL):
result = re.split('<.+?>',i)
print (result[1], result[3]) # Print Score
And I got this error:
Traceback (most recent call last):
File "C:\Users\Bright\Desktop\live_score.py", line 12, in <module>
for i in re.findall(pattern, xml_data, re.DOTALL):
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
On your first example - the site is loading its content by heavy javascript so I suggest using selenium in fetching the source.
Your code should look like this:
import bs4 as bs
from selenium import webdriver
url = 'http://www.livescore.com/'
browser = webdriver.Chrome()
browser.get(url)
sauce = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(sauce,'lxml')
for div in soup.find('div', attrs={'data-type': 'container'}).find_all('div'):
print(div.text)
For the second example, it regular expression engine returns an error because the read()
function from your requests gives byte data type, "re" only accepts strings or unicode. So you just t have toypecast xml_data to str.
This is the modified code:
for i in re.findall(pattern, str(xml_data), re.DOTALL):
result = re.split('<.+?>',i)
print (result[1], result[3]) # Print Score