Search code examples
javaswingjtextpanereader

How to only load text contents into JTextPane that are within the <body> tags?


Right now, I have a JTextPane in Java Swing that loads contents from a file into the pane. However, it loads everything including all the tags. I would like it to only load the contents. Is there a way to get to the tag and load the portion in between <body> and </body>?

Here is the code

public class LoadContent {

String path = "../WordProcessor_MadeInSwing/backups/testDir/cool_COPY3.rtf";

public void load(JTextPane jTextPane){
    try {
        FileReader fr = new FileReader(path);
        BufferedReader reader = new BufferedReader(fr);
        jTextPane.read(reader, path);

    } catch (FileNotFoundException ex) {
        ex.printStackTrace();
    }
    catch(IOException e){

    }
}

}

If my .rtf file contains the word "Here is a test", it will load as:

<html>
  <head>
    <style>
      <!--
        p.default {
          family:Dialog;
          size:3;
          bold:normal;
          italic:;
          foreground:#333333;
        }
      -->
    </style>
  </head>
  <body>
    <p class=default>
      <span style="color: #333333; font-size: 12pt; font-family: Dialog">
        Here is a test
      </span>
    </p>
  </body>
</html>

I only want it to load "Here is a test"


Solution

  • I would like it to only load the contents

    Then you need to parse out the contents first before displaying the text.

    Here is a simple example to display the text between the Span tags:

    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    
    class GetSpan
    {
        public static void main(String[] args)
            throws Exception
        {
            // Create a reader on the HTML content
    
            Reader reader = getReader( args[0] );
    
            // Parse the HTML
    
            EditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            kit.read(reader, doc, 0);
    
            // Find all the Span elements in the HTML document
    
            HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.SPAN);
    
            while (it.isValid())
            {
                int start = it.getStartOffset();
                int end = it.getEndOffset();
                String text = doc.getText(start, end - start);
                System.out.println(  text );
                it.next();
            }
        }
    
        // If 'uri' begins with "http:" treat as a URL,
        // otherwise, treat as a local file.
        static Reader getReader(String uri)
            throws IOException
        {
            // Retrieve from Internet.
            if (uri.startsWith("http"))
            {
                URLConnection conn = new URL(uri).openConnection();
                return new InputStreamReader(conn.getInputStream());
            }
            // Retrieve from file.
            else
            {
                return new FileReader(uri);
            }
        }
    }
    

    Just run the class with your file as the parameter.

    Edit:

    Just noticed the question has been changed to look for text in the <body> tag instead of the <span> tag. For some reason an iterator is not returned for the <body> tag.

    So another option is to use a ParserCallback. The callback will notify you every time a starting tag (or ending tag) is found, or when text of any tag is found.

    A basic example would be:

    import java.io.*;
    import java.net.*;
    import javax.swing.text.*;
    import javax.swing.text.html.parser.*;
    import javax.swing.text.html.*;
    
    public class ParserCallbackText extends HTMLEditorKit.ParserCallback
    {
        private boolean isBody = false;
    
        public void handleText(char[] data, int pos)
        {
            if (isBody)
                System.out.println( data );
        }
    
        public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
        {
            if (tag.equals(HTML.Tag.BODY))
            {
                isBody = true;
            }
        }
    
        public static void main(String[] args)
            throws Exception
        {
            Reader reader = getReader(args[0]);
            ParserCallbackText parser = new ParserCallbackText();
            new ParserDelegator().parse(reader, parser, true);
        }
    
        static Reader getReader(String uri)
            throws IOException
        {
            // Retrieve from Internet.
            if (uri.startsWith("http"))
            {
                URLConnection conn = new URL(uri).openConnection();
                return new InputStreamReader(conn.getInputStream());
            }
            // Retrieve from file.
            else
            {
                return new FileReader(uri);
            }
        }
    }
    

    The above example will ignore any text found the <head> tag.