Search code examples
pythonxpathlxmllxml.htmlcalibre

Select and modify xpath nodes after specific text


enter image description here

I use this code to get all names:

def parse_authors(self, root): 
    author_nodes = root.xpath('//a[@class="booklink"][contains(@href,"/author/")]/text()')
    if author_nodes:
        return [unicode(author) for author in author_nodes]

But i want if there are any translators to add "(translation)" next to their names:

example:translator1(translation)

Solution

  • You can use that translation: text node to distinguish authors from translators - authors are preceding siblings of the "translation:" text node, translators - following siblings.

    Authors:

    //text()[contains(., 'translation:')]/preceding-sibling::a[@class='booklink' and contains(@href, '/author/')]/text()
    

    Translators:

    //text()[contains(., 'translation:')]/following-sibling::a[@class='booklink' and contains(@href, '/author/')]/text()
    

    Working sample code:

    from lxml.html import fromstring
    
    data = """
    <td>
        <a class="booklink" href="/author/43710/Author 1">Author 1</a>
        ,
         <a class="booklink" href="/author/46907/Author 2">Author 2</a>
         <br>
         translation:
         <a class="booklink" href="/author/47669/translator 1">Translator 1</a>
         ,
         <a class="booklink" href="/author/9382/translator 2">Translator 2</a>
    </td>"""
    
    root = fromstring(data)
    
    authors = root.xpath("//text()[contains(., 'translation:')]/preceding-sibling::a[@class='booklink' and contains(@href, '/author/')]/text()")
    translators = root.xpath("//text()[contains(., 'translation:')]/following-sibling::a[@class='booklink' and contains(@href, '/author/')]/text()")
    
    print(authors)
    print(translators)
    

    Prints:

    ['Author 1', 'Author 2']
    ['Translator 1', 'Translator 2']