Search code examples
htmlextractjsoupforum

Extract the thread head and thread reply from a forum


I want to extract only the views and replies of the user and the title of the head from a forum. In this code when you supply a url the code returns everything. I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. Help me how extract. Explain how to print this in a txt file

package extract;

import java.io.*;

import org.jsoup.*;

import org.jsoup.nodes.*;

public class TestJsoup
{
   public void SimpleParse()  
   {        
        try  
        {

            Document doc = Jsoup.connect("url").get();

            doc.body().wrap("<div></div>");

            doc.body().wrap("<pre></pre>");
            String text = doc.text();
           // Converting nbsp entities

            text = text.replaceAll("\u00A0", " ");

            System.out.print(text);

         }   
         catch (IOException e) 
         {

            e.printStackTrace();

         }

    }

    public static void main(String args[])
    {

      TestJsoup tjs = new TestJsoup();

      tjs.SimpleParse();

    }

}

Solution

  • Why do you wrapt the body-Element in a div and a pre Tag?

    The title-Element can be selected like this:

    Document doc = Jsoup.connect("url").get();
    
    Element titleElement = doc.select("title").first();
    String titleText = titleElement.text();
    
    // Or shorter ...
    
    String titleText = doc.select("title").first().text();
    

    Div-Tags:

    // Document 'doc' as above
    
    Elements divTags = doc.select("div");
    
    
    for( Element element : divTags )
    {
        // Do something there ... eg. print each element
        System.out.println(element);
    
        // Or get the Text of it
        String text = element.text();
    }
    

    Here's an overview about the whole Jsoup Selector API, this will help you finding any kind of element you need.