Search code examples
domxpathjsouphtml-parsingsample

How to extract relative Xpaths of a WebPage from a URL


I am working on a program to list out the elements inside a webpage and their corresponding relative xpaths. Using Java and JSoup, I want to extract relative Xpaths created dynamically for all the elements inside any given webPage. A complete and small working utility will definitely help me here.

I want something like:

//*[@id="menu-item-13686"]/a

Sample output:

Element Or Node or component Name: xxxx AND Xpath = //*[@id="menu-item-13686"]/a

Thank you


Solution

  • I think you can start with this.

    Issue got fixed in version jOOX - 1.6.1 Compiled with Java 10

    Refer https://github.com/jOOQ/jOOX/issues/158

    The below code snippet selects all the elements and for each element prints out node name, tag name and CSS selector and Xpath that will uniquely select this element.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.joox.selector.CSS2XPath;
    
    public class TestParser {
    
    public static void main(String[] args) {
    
        try {
            Document doc = Jsoup.connect("https://theuserisdrunk.com/").get();
            Elements elements = doc.select("*");
            for (Element element : elements) {
                String path = CSS2XPath.css2xpath(element.cssSelector(), true);
                System.out.println("Node name : " + element.nodeName());
                System.out.println("      Tag : " + element.tagName());
                System.out.println("      CSS : " + element.cssSelector());
                System.out.println("    XPath : " + path + "\n");
    
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    

    }

    Sample Output:

     Node name : div
          Tag : div
          CSS : #mc-embedded-subscribe-form > div.clear:nth-child(4)
        XPath : //*[@id='mc-embedded-subscribe-form']/div[@class='clear' or starts-with(@class, 'clear ') or ' clear' = substring(@class, string-length(@class) - string-length(' clear') + 1) or contains(@class, ' clear ')][count(preceding-sibling::*) = 4 - 1]
    
    Node name : input
          Tag : input
          CSS : #mc-embedded-subscribe
        XPath : //*[@id='mc-embedded-subscribe']
    
    Node name : p
          Tag : p
          CSS : #mc_embed_signup > p.intern:nth-child(2)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 2 - 1]
    
    Node name : a
          Tag : a
          CSS : #mc_embed_signup > p.intern:nth-child(2) > a
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 2 - 1]/a
    
    Node name : p
          Tag : p
          CSS : #mc_embed_signup > p.intern:nth-child(3)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 3 - 1]
    
    Node name : i
          Tag : i
          CSS : #mc_embed_signup > p.intern:nth-child(3) > i
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 3 - 1]/i
    
    Node name : p
          Tag : p
          CSS : #mc_embed_signup > p.intern:nth-child(4)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 4 - 1]
    
    Node name : a
          Tag : a
          CSS : #mc_embed_signup > p.intern:nth-child(4) > a:nth-child(1)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 4 - 1]/a[count(preceding-sibling::*) = 1 - 1]
    
    Node name : a
          Tag : a
          CSS : #mc_embed_signup > p.intern:nth-child(4) > a:nth-child(2)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 4 - 1]/a[count(preceding-sibling::*) = 2 - 1]
    
    Node name : i
          Tag : i
          CSS : #mc_embed_signup > p.intern:nth-child(4) > i
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 4 - 1]/i
    
    Node name : a
          Tag : a
          CSS : #mc_embed_signup > p.intern:nth-child(4) > a:nth-child(4)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 4 - 1]/a[count(preceding-sibling::*) = 4 - 1]
    
    Node name : a
          Tag : a
          CSS : #mc_embed_signup > p.intern:nth-child(4) > a:nth-child(5)
        XPath : //*[@id='mc_embed_signup']/p[@class='intern' or starts-with(@class, 'intern ') or ' intern' = substring(@class, string-length(@class) - string-length(' intern') + 1) or contains(@class, ' intern ')][count(preceding-sibling::*) = 4 - 1]/a[count(preceding-sibling::*) = 5 - 1]
    
    Node name : script
          Tag : script
          CSS : html > body > script:nth-child(3)
        XPath : //html/body/script[count(preceding-sibling::*) = 3 - 1]
    
    Node name : script
          Tag : script
          CSS : html > body > script:nth-child(4)
        XPath : //html/body/script[count(preceding-sibling::*) = 4 - 1]