Search code examples
javahtmlcssjsoupfileutils

What queries should I use for extracting symbols from a html page using Jsoup?


I am trying to extract the emojis listed on this site http://www.i2symbol.com/emoticons/angry by usin JSoup library for java.

I have noticed in the html-source of the page that every emoji is contained in the following div id:

The code for the following symbol is

ヽ(ಠ_ಠ)ノ
<div id="symbol_0" data-symbols="&#x30FD;(&#x0CA0;_&#x0CA0;)&#x30CE;" contenteditable="true">&#x30FD;(&#x0CA0;_&#x0CA0;)&#x30CE;</div>
\(`0´)/
<div id="symbol_9" data-symbols="&#65340;&#40;&#65344;&#48;&#180;&#41;&#65295;" contenteditable="true">&#65340;&#40;&#65344;&#48;&#180;&#41;&#65295;</div>
(╯°□°)╯︵ ┻━┻
<div id="symbol_10" data-symbols="&#40;&#9583;&#176;&#9633;&#176;&#65289;&#9583;&#65077;&#32;&#9531;&#9473;&#9531;" contenteditable="true">&#40;&#9583;&#176;&#9633;&#176;&#65289;&#9583;&#65077;&#32;&#9531;&#9473;&#9531;</div>

So basically, the symbols are HTML HEX codes. I looked at the the selector syntax given here https://jsoup.org/cookbook/extracting-data/selector-syntax But, I am unable to craft an appropriate selector query to extract these symbols out of the html page.

And also, since there are about 27 symbols thats needs to be extracted from this page. How do I save these symbols to an external text file.

With help from @Dave, i WAS able to write this code. But, this prints the whole line of code, how can i just use it extract

(╯°□°)╯︵ ┻━┻

from

<div id="symbol_10" data-symbols="&#40;&#9583;&#176;&#9633;&#176;&#65289;&#9583;&#65077;&#32;&#9531;&#9473;&#9531;" contenteditable="true">&#40;&#9583;&#176;&#9633;&#176;&#65289;&#9583;&#65077;&#32;&#9531;&#9473;&#9531;</div>

MY Java code

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLParserExample3 {

  public static void main(String[] args) {

    Document doc;
    try {
        doc = Jsoup.connect("http://www.i2symbol.com/emoticons/angry").get();

        Elements symbols= doc.select("div[^data-symbols]");
        for(Element symbol : symbols) {
            System.out.println("\nSymbol: " + symbol);
        }



    } catch (IOException e) {
        e.printStackTrace();
    }

  }

}

Solution

  • It looks like they all use the HTML5 data attribute (e.g. "data-symbols"), and according to the selector docs you can use the following to filter elements by the data attribute:

    [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes

    With that in mind, give this a shot:

    Elements symbols= doc.select("div[^data-symbols]");
    

    As for writing it out to a file, if you want that file to be HTML you can try something like this.

    Update:

    JSoup has a way to do what you want listed here.

    If we apply that to your case we and add it to what we previously had then we get:

    Elements symbols= doc.select("div[^data-symbols]");
    for (Element s: symbols) {
       String symbol= s.attr("data-symbols");
       System.out.println(symbol);
    }