For a research project similar to this one I want to extract all "documentation units" from the python documentation. A documentation unit in the python documentation can be (as html-meta tag):
And these should be nested: A section contains several classes, which contain several methods. But in fact that is very irregular.
Example1: If a section contains several classes and methods: I want to get each method alone and each class without the methods (which I already got) and the section without the classes (which I already have in that case) and without the methods (which I also already have) but with the rest of the it (as there is a lot of additional stuff in there)
Example2: If a method or a class has no section in which they appear, I also want them as mentioned above and must not forget them.
Note: Doesn't make it easier but I would like to get all of them in a list which has the same order as in the original documentation.
I tried it with BeautifulSoup but I guess for that purpose I need to search "from leaf to root" to get the deepest elements first - which is (AFAIK) not supported by BeautifulSoup4.
First I thought the problem is to avoid duplicates but in fact that is not the main problem.
I appreciate your hints.
Seems that this is not possible.
So what i did to solve this problem is to iterate over the elements (which i got using .descendants) again and again and then i replaced the nested-elements with a placeholder to make that change visible (using replace_with).
As i used .descendants before, the nested elements are stored anyway.