Search code examples
jsoupselectors-api

how to extract text value using jsoup


I am trying to extract a text value that comes right after CheckBoxIsChecked="t"

p  > w|Sdt[CheckBoxIsChecked$='t']

but it seems like jsoup is ignoring it, I am not sure how to read the text that comes after this I can do it using java but I am trying to make it generic is there something like:

p  > w|Sdt[CheckBoxIsChecked$='t']  > first text after...

in this example the value needed is:
I Need this value since CheckBoxIsChecked is true

<p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<w:Sdt CheckBox="t" CheckBoxIsChecked="t" >
    <span style="font-family:&quot;MS Gothic&quot;">y</span>
</w:Sdt> I Need this value since CheckBoxIsChecked is true 
<w:Sdt CheckBox="t" CheckBoxIsChecked="f" >
    <span style="font-family:&quot;MS Gothic&quot;">n</span>
</w:Sdt> This is not needed since CheckBoxIsChecked is false 
<w:Sdt CheckBox="t" CheckBoxIsChecked="f">
    <span style="font-family:&quot;MS Gothic&quot;">n</span>
</w:Sdt> This is not needed since CheckBoxIsChecked is false<o:p/>

link to the sample


Solution

  • You can use Element.ownText() method to extract text laying next to specific tag. Below you can find an example created based on your example:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    
    public class Example {
    
        public static void main(String[] args) {
            String html = "<p class=\"MsoNormal\" style=\"margin-bottom:0in;margin-bottom:.0001pt;line-height:normal\">\n" +
                    "<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"t\" >\n" +
                    "    <span style=\"font-family:&quot;MS Gothic&quot;\">y</span>\n" +
                    "</w:Sdt> I Need this value since CheckBoxIsChecked is true \n" +
                    "<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"f\" >\n" +
                    "    <span style=\"font-family:&quot;MS Gothic&quot;\">n</span>\n" +
                    "</w:Sdt> This is not needed since CheckBoxIsChecked is false \n" +
                    "<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"f\">\n" +
                    "    <span style=\"font-family:&quot;MS Gothic&quot;\">n</span>\n" +
                    "</w:Sdt> This is not needed since CheckBoxIsChecked is false<o:p/>";
    
            Document doc = Jsoup.parse(html);
    
            doc.select("p > w|sdt[checkboxischecked=t]").forEach(it -> {
                String text = it.ownText();
                System.out.println(text);
            });
    
        }
    }
    

    Here you can run Demo