I'm trying to scrap/select all the tables that have 'Consolidated Schedule of Investments' in the title but the problem is that for each pages it exists in a different position or html structure for those pages :
https://www.sec.gov/Archives/edgar/data/1287750/000128775023000021/arcc-20230331.htm https://www.sec.gov/Archives/edgar/data/1633336/000095017023020540/ccap-20230331.htm https://www.sec.gov/Archives/edgar/data/1534254/000153425423000008/cion-20230331.htm
This code will select the element, but the next step is to select the closest table to it, either if it's sibiling ascendent or descendant :
//span[contains(translate(., 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 'CONSOLIDATED SCHEDULE OF INVESTMENTS')]
I think you want to use ancestor
and descendant
and following
e.g.
//span[contains(translate(., 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 'CONSOLIDATED SCHEDULE OF INVESTMENTS')]/(ancestor::table[1] | descendant::table[1] | following::table[1])[1]
the (ancestor::table[1] | descendant::table[1] | following::table[1])[1]
should take care of "either if it's sibiling ascendent or descendant".
Note: the used syntax is only supported in current XPath (i.e. not in 1.0) so I am not quite sure you can use it; in the Python world there are at least two options to use the current version 3.1 of XPath, namely ElementPath https://pypi.org/project/elementpath/ and SaxonCHE https://pypi.org/project/saxonche/.