Search code examples
pythondocumentationbeautifulsouplxmlhtml5lib

BeautifulSoup: Search from leaf to root to get the "deepest" elements first?


For a research project similar to this one I want to extract all "documentation units" from the python documentation. A documentation unit in the python documentation can be (as html-meta tag):

  • a method (dl class: method)
  • a class (dl class: class)
  • a section (div class: section)

And these should be nested: A section contains several classes, which contain several methods. But in fact that is very irregular.

Example1: If a section contains several classes and methods: I want to get each method alone and each class without the methods (which I already got) and the section without the classes (which I already have in that case) and without the methods (which I also already have) but with the rest of the it (as there is a lot of additional stuff in there)

Example2: If a method or a class has no section in which they appear, I also want them as mentioned above and must not forget them.

Note: Doesn't make it easier but I would like to get all of them in a list which has the same order as in the original documentation.

I tried it with BeautifulSoup but I guess for that purpose I need to search "from leaf to root" to get the deepest elements first - which is (AFAIK) not supported by BeautifulSoup4.

First I thought the problem is to avoid duplicates but in fact that is not the main problem.

I appreciate your hints.


Solution

  • Seems that this is not possible.

    So what i did to solve this problem is to iterate over the elements (which i got using .descendants) again and again and then i replaced the nested-elements with a placeholder to make that change visible (using replace_with).

    As i used .descendants before, the nested elements are stored anyway.