Search code examples
javahtmljsoupjakarta-mail

How can I remove those html elements, while retain the formatting?


I have try to implement the java mail api to read body of the message and store it into text file if it contains contents.

I can able to read the body of the message but it comes with some html elements.

I have added below code in which I have used.

Properties props = System.getProperties();
    props.setProperty("mail.store.protocol", "imaps");

    Session session = Session.getDefaultInstance(props, null);
    Store store = session.getStore("imaps");
    store.connect("hostname", "username", "password");
    String result = null;
    Folder inbox = store.getFolder("Inbox");
    inbox.open(Folder.READ_ONLY);
    javax.mail.Message messages[]=inbox.search(new FlagTerm(new Flags(Flag.SEEN), false));
    for(Message message:messages) {
        System.out.println(Jsoup.parse(message).text());
    }

How can I remove those html elements in retrieved message?

Please anyone help me to solve this.


Solution

  • To remove all HTML tags in your mail use the jsoups text() method.

    Example Code

    String htmlString = "<div class=\"WordSection1\"> <p class=\"MsoNormal\">Hi<br> <br> <br> <br> Data is written in this mail.<br> <br> <br> <br> <o:p></o:p></p> </div>";
    
    System.out.println(Jsoup.parse(htmlString).text());
    

    Output

    Hi Data is written in this mail.
    

    If specific elements should be result in line-breaks similar to the rendered HTML source, you could add line-breaks and then avoid pretty printing it, when you jsoups' clean method.

    prettyPrint

    If disabled, the HTML output methods will not re-format the output, and the output will generally look like the input.

    Example Code

    String htmlString = "<div class=\"WordSection1\"> <p class=\"MsoNormal\">Hi<br> <br> <br> <br> Data is written in this mail.<br> <br> <br> <br> <o:p></o:p></p> </div>";
    
    htmlString = htmlString.replaceAll("<br>", System.getProperty("line.separator") + "<br>"); // do replacements for all tags that should result in line-breaks
    
    Document.OutputSettings settings = new OutputSettings();
    settings.prettyPrint(false); // to keep line-breaks
    
    String cleanedSource = Jsoup.clean(htmlString, "", Whitelist.none(), settings);
    
    System.out.println(cleanedSource);
    

    Output

     Hi
    
    
    
     Data is written in this mail.
    [... four more empty lines]