Search code examples
pythonhtmlbeautifulsouphtml-parsing

How to find list of all HTML Tags which are active on a particular data


I want to parse HTML to convert it to some other format while keeping some of the styles (Bolds, lists, etc).

To better explain what I mean,

Consider the following code:

<html>
<body>

<h2>A Nested List</h2>
<p>List <b>can</b> be nested (lists inside lists):</p>

<ul>
  <li>Coffee</li>
  <li>Tea
    <ul>
      <li>Black tea</li>
      <li>Green tea</li>
    </ul>
  </li>
  <li>Milk</li>
</ul>

</body>
</html>

Now if I were to select the word "List" at the start of the paragraph, my output should be (html, body,p), since those are the tags active on the word "List".

Another example, if I were to select the word "Black tea", my output should be (html,body,ul,li,ul,li), since it's part of the nested list.

I've seen chrome inspector do this but I'm not sure how I can do this in code by using Python.

Here is an Image of what the chrome inspector shows: Chrome Inspector

I've tried parsing through the HTML using Beautiful soup and while it is amazing for getting a data, I was unable to solve my problem using it.

Later I tried the html-parser for this same issue, trying to make a stack of all tags before a "data" and popping them out as I encounter corresponding end-tags, but I couldn't do it either.


Solution

  • As you said in your comment, it may or may not get you what you want, but it may be a start. So I would try it anyway and see what happens:

    from lxml import etree
    snippet = """[your html above]"""
    root = etree.fromstring(snippet)
    
    tree = etree.ElementTree(root)
    targets = ['List','nested','Black tea']
    for e in root.iter():
        for target in targets:
            if (e.text and  target in e.text) or (e.tail and target in e.tail):
                print(target,' :',tree.getpath(e))    
    

    Output is

    List  : /html/body/h2
    List  : /html/body/p
    nested  : /html/body/p/b
    Black tea  : /html/body/ul/li[2]/ul/li[1]
    

    As you can see, what this does is give you the xpath to selected text targets. A couple of things to note: first, "List" appears twice because it appears twice the text. Second: the "Black tea" xpath contains positional values (for example, the [2] in /li[2]) which indicate that the target string appears in the second li element of the snippet, etc. If you don't need that, you may need to strip that information from the output (or use another tool).