Search code examples
pythonxpathscrapyseparator

Scrapy selector for nodes between <br /> tags


I have HTML code like this

<div>
  Foo <span>Bar</span><br />
  Baz<br />
  <b>Foobar</b> Quux
</div>

Now I'd like to process the nodes separated by <br /> tags like this:

nodes  = sel.xpath("???")
my_foo = nodes[0] # contains Foo <span>Bar</span>
my_bar = nodes[1] # contains Bar
my_fb  = nodes[2] # contains <b>Foobar</b> Quux

Is there some XPath or CSS expression that will do this or do I have to iterate over all child nodes of <div>, building an array in the process for each node that is not a <br>?


Solution

  • The closest I can think of is this:

    [sel.xpath('''.//div/node()[count(preceding-sibling::br)=%d]
                               [not(self::br)]''' % i).extract()
     for i in range(0, len(sel.xpath('.//div/br'))+1)]
    

    which gives you:

    [[u'\n  Foo ', u'<span>Bar</span>'],
     [u'\n  Baz'],
     [u'\n  ', u'<b>Foobar</b>', u' Quux\n']]
    

    which gives you lists of node between the <br/> elements under <div> (counting the <br>s and looking for nodes that have <br>s before (none, then 1, then 2))