Search code examples
javahtmlhtml-parsingjsouphref

Jsoup find the nearest href


I have a map of strings, basically what I am doing at the moment is getting the Page body and splitting it into words by using jsoup.getPageBody().split("[^a-zA-Z]+") and then iterating through the page body and checking if any of the words exist in my map of strings, such as below:

for (String word : jsoup.getPageBody().split("[^a-zA-Z]+")) {
    if (wordIsInMap(word.toLowerCase()) {
        //At this part word is in string of maps
    }
}

When I am at the inside of the loop, I would like to get the closest hyperlink(href). The distance is determined by the amount of word. I couldn't find any examples like that on jsoup documentation page. How can I do that?

An example is for this page: http://en.wikipedia.org/wiki/2012_in_American_television

If the map of strings are race and crucial then I want to get:

http://en.wikipedia.org/wiki/Breeders%27_Cup_Classic

http://en.wikipedia.org/wiki/Fox_Broadcasting_Company

these two links.


Solution

  • Here is a super simple implementation which should get you started. It doesn't find the link closest based on number of words though. Ill leave that up to you to modify.

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.nodes.Node;
    import org.jsoup.nodes.TextNode;
    
    import java.util.List;
    
    public class Program {
    
    public static void main(String...args) throws Exception {
        String searchFor = "online and";
    
        Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/2012_in_American_television").get();
        Element element = doc.getElementsContainingOwnText(searchFor).first();
    
        Node nodeWithText = getFirstNodeContainingText(element.childNodes(), searchFor);
        Element closestLink = getClosestLink(nodeWithText);
    
        System.out.println("Link closest to '" + searchFor + "': " + closestLink.attr("abs:href"));
    }
    
    private static Element getClosestLink(Node node) {
        Element linkElem = null;
        if (node instanceof Element) {
            Element element = (Element) node;
            linkElem = element.getElementsByTag("a").first();
        }
        if (linkElem != null) {
            return linkElem;
        }
    
        // This node wasn't a link. try next one
        linkElem = getClosestLink(node.nextSibling());
        if (linkElem != null) {
            return linkElem;
        }
    
        // Wasn't next link. try previous
        linkElem = getClosestLink(node.previousSibling());
        if (linkElem != null) {
            return linkElem;
        }
    
        return null;
    }
    
    private static Node getFirstNodeContainingText(List<Node> nodes, String text) {
        for (Node node : nodes) {
            if (node instanceof TextNode) {
                String nodeText = ((TextNode) node).getWholeText();
                if (nodeText.contains(text)) {
                    return node;
                }
            }
        }
        return null;
    }
    

    }