Search code examples
jsoupscreen-scrapingwikipedia

wikipedia scraping plain text and hyperlink with jsoup


I have a Wikipedia element that looks like this that I want to scrape with Jsoup. I want to take elements into a list of string and separate them when there is
if that makes sense. Right now, I am looping Elements in all children of , which misses plain texts like CCCC and GGGG. Is there any way to catch plain texts as well as hyperlinked texts?

<td class="" style="" itemprop="">
<a href="/wiki/%E5%9C%8B%E5%AD%B8%E9%99%A2%E5%A4%A7%E5%AD%B8" title="AAAA">AAAA</a> 
<a href="/wiki/%E6%96%87%E5%AD%A6%E9%83%A8" title="BBBB">BBBB</a>
"CCCC"
<br>
"DDDD"
<a href="/wiki/%E5%A4%A7%E5%AD%A6%E9%99%A2" title="EEEE">EEEE</a>
<a href="/wiki/%E6%96%87%E5%AD%A6%E7%A0%94%E7%A9%B6%E7%A7%91" title="FFFF">FFFF</a> 
<br>
GGGG
</td>

the Wikipedia page looks like this (bold are hyperlinked texts):

AAAABBBBCCCC

DDDDEEEEFFFF

GGGG

I want to create a list like this: [AAAABBBBCCCC, DDDDEEEEFFFF, GGGGG]


Solution

  • In this specific case you can do a preprocessing on html to make things easer for Jsoup. Try this code:

        String html = "<table><td class=\"\" style=\"\" itemprop=\"\">\n" +
                "<a href=\"/wiki/%E5%9C%8B%E5%AD%B8%E9%99%A2%E5%A4%A7%E5%AD%B8\" title=\"AAAA\">AAAA</a> \n" +
                "<a href=\"/wiki/%E6%96%87%E5%AD%A6%E9%83%A8\" title=\"BBBB\">BBBB</a>\n" +
                "\"CCCC\"\n" +
                "<br>\n" +
                "\"DDDD\"\n" +
                "<a href=\"/wiki/%E5%A4%A7%E5%AD%A6%E9%99%A2\" title=\"EEEE\">EEEE</a>\n" +
                "<a href=\"/wiki/%E6%96%87%E5%AD%A6%E7%A0%94%E7%A9%B6%E7%A7%91\" title=\"FFFF\">FFFF</a> \n" +
                "<br>\n" +
                "GGGG\n" +
                "</td></table>";
    
        html = html.replace("<br>", "</td><td>");
    
        Document doc = Jsoup.parse(html);
        List<String> result = doc.select("td").eachText()
                .stream()
                .map(r -> r.replace("\"", ""))
                .map(r -> r.replace(" ", ""))
                .collect(Collectors.toList());
        System.out.println(result);