Search code examples
javahtmlregexregexp-replace

How can I use a regex to remove HTML tags from a String?


I'm trying to use String.replaceAll(String regex, String replacement) to filter information out of an HTML document, i.e. HTML code. My aim is to remove all <>-brackets and the contents within them. To do this, I want to simply use an empty String ("") as the replacement String. For example, this:

<tr class='list odd'>
<td class="list" align="center">Do</td>
<td class="list" align="center">7.7.</td><td class="list" align="center">3 - 4</td>
<td class="list" align="center">---</td>
<td class="list" align="center"><s>Q1e14</s></td>
<td class="list" align="center">Arbeitsauftrag:</td>
<td class="list" align="center">entfällt</td></tr>

Should turn into this:

Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag
entfällt

I'm completely new to regex and after watching some tutorials I came up with these regexes:

\u003C([a-zA-Z0-9]|\s|\S)+
[\u003C]([a-zA-Z0-9]|\s|\W)+\u003E

I built them using this website: https://regexr.com However, while they at least kind of seem to work there, they both result in a StackOverflowError in my code.

(Note that my IDE, IntelliJ, automatically makes each backslash into two backslashes. I think this is just adjusting the JavaScript regex to Java, but I could be wrong.)

TL;DR: How can I replace HTML tags with <>-brackets and their contents with an empty String using replaceAll (or something else if there is an alternative)?


Solution

  • Use a proper HTML-parser like Jsoup, instead of string manipilation or regex. Jsoup provides a very convenient API for extracting and manipulating HTML data and is intuitive to work with. Using Jsoup your code could look like:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class Example2 {
        public static void main(String[] args) {
            String html =
                      "<html>\n"
                    + "<head></head>"
                    + "<body>"
                    + "  <table>"
                    + "     <tr class='list odd'>\n"
                    + "        <td class=\"list\" align=\"center\">Do</td>\n"
                    + "        <td class=\"list\" align=\"center\">7.7.</td><td class=\"list\" align=\"center\">3 - 4</td>\n"
                    + "        <td class=\"list\" align=\"center\">---</td>\n"
                    + "        <td class=\"list\" align=\"center\"><s>Q1e14</s></td>\n"
                    + "        <td class=\"list\" align=\"center\">Arbeitsauftrag:</td>\n"
                    + "        <td class=\"list\" align=\"center\">entfällt</td></tr>\n"
                    + "   </table>"
                    + "</body>\n"
                    + "</html>";
    
            Document doc = Jsoup.parse(html);
    
            Elements tds = doc.select("td");
            tds.forEach(td -> System.out.println(td.text()));
        }
    }
    

    output:

    Do
    7.7.
    3 - 4
    ---
    Q1e14
    Arbeitsauftrag:
    entfällt
    

    Maven repo:

    <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.15.2</version>
    </dependency>