Search code examples
javajsoup

Getting sub links of a URL using jsoup


Consider a URl www.example.com it may have plenty numbers of links ,some may be internal and other may be external.I want to get a list of all the sub links ,not even the sub-sub links but only sub link. E.G if there are four links as follows

1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data

Then out of the four only 2 and 3 are of use as they are sub links not the sub-sub and so on links .Is there a way to achieve it through j-soup..If this could not be achieved through j-soup then one can introduce me with some other java API. Also note that it should be a link of the parent Url which is initially sent(i.e. www.example.com)


Solution

  • If i can understand a sub-link can contain one slash you can attempt with this with counting the number of slashes for example :

    List<String> list = new ArrayList<>();
    list.add("www.example.com/images/main");
    list.add("www.example.com/data");
    list.add("www.example.com/users");
    list.add("www.example.com/admin/data");
    

    for(String link : list){
        if((link.length() - link.replaceAll("[/]", "").length()) == 1){
            System.out.println(link);
        }
    }
    

    link.length(): count the number of characters
    link.replaceAll("[/]", "").length() : count the number of slashes

    If the difference equal to one then right link else no.


    EDIT

    How will i scan the whole website for sub links?

    The answer for this with the robots.txt file or Robots exclusion standard, so in this it define all the sub-links of the web site for example https://stackoverflow.com/robots.txt, so the idea is, to read this file and you can extract the sub-links from this web-site here is a piece of code that can help you :

    public static void main(String[] args) throws Exception {
    
        //Your web site
        String website = "http://stackoverflow.com";
        //We will read the URL https://stackoverflow.com/robots.txt
        URL url = new URL(website + "/robots.txt");
    
        //List of your sub-links
        List<String> list;
    
        //Read the file with BufferedReader
        try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
            String subLink;
            list = new ArrayList<>();
    
            //Loop throw your file
            while ((subLink = in.readLine()) != null) {
    
                //Check if the sub-link is match with this regex, if yes then add it to your list
                if (subLink.matches("Disallow: \\/\\w+\\/")) {
                    list.add(website + "/" + subLink.replace("Disallow: /", ""));
                }else{
                    System.out.println("not match");
                }
            }
        }
    
        //Print your result
        System.out.println(list);
    }
    

    This will show you :

    [https://stackoverflow.com/posts/, https://stackoverflow.com/posts?, https://stackoverflow.com/search/, https://stackoverflow.com/search?, https://stackoverflow.com/feeds/, https://stackoverflow.com/feeds?, https://stackoverflow.com/unanswered/, https://stackoverflow.com/unanswered?, https://stackoverflow.com/u/, https://stackoverflow.com/messages/, https://stackoverflow.com/ajax/, https://stackoverflow.com/plugins/]

    Here is a Demo about the regex that i use.

    Hope this can help you.