Search code examples
pythonregexstylesheet

Regex in Python - find all stylesheets in html


This is part of my html code:

<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  />
<link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />

I have to find all hrefs of stylesheets.

I tried to use regular expression like

 <link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>

The full code is

body = '''<link rel ="stylesheet" type="text/css" href="catalog/view/theme/default/stylesheet/stylesheet.css" />
<link id='all-css-0' href='http://1' type='text/css' media='all' rel='stylesheet'  />
<link rel='stylesheet'  id='all-css-1' href =   'http://2' type='text/css' media='all' />''''

real_viraz = '''<link\s+rel\s*=\s*["']stylesheet["']\s*href\s*=\s*["'](.*?)["'][^>]*?>'''
r = re.findall(real_viraz, body, re.I|re.DOTALL)
print r

But the problem is that rel='stylesheet' and href='' can be in any order in <link ...>, and it can be almost everything between them.

Please help me to find the right regular expression. Thanks.


Solution

  • Short answer: Don't use regular expressions to parse (X)HTML, use a (X)HTML parser.

    In Python, this would be lxml. You could parse the HTML using lxml's HTML Parser, and use an XPath query to get all the link elements, and collect their href attributes:

    from lxml import etree
    
    parser = etree.HTMLParser()
    
    doc = etree.parse(open('sample.html'), parser)
    links = doc.xpath("//head/link[@rel='stylesheet']")
    hrefs = [l.attrib['href'] for l in links]
    
    print hrefs
    

    Output:

    ['catalog/view/theme/default/stylesheet/stylesheet.css', 'http://1', 'http://2']