This is part of my html code:
<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />
I have to find all hrefs of stylesheets.
I tried to use regular expression like
<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>
The full code is
body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet' />
<link rel='stylesheet' id='all-css-1' href = 'http://2' type='text/css' media='all' />''''
real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r
But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>
, and it can be almost everything between them.
Please help me to find the right regular expression. Thanks.
Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.
In Python, this would be lxml
. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link
elements, and collect their href
attributes:
from lxml import etree
parser = etree.HTMLParser()
doc = etree.parse(open('sample.html'), parser)
links = doc.xpath("//head/link[@rel='stylesheet']")
hrefs = [l.attrib['href'] for l in links]
print hrefs
Output:
['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']