Search code examples
javaformattingjtextarea

Remove all formatting, numbered lists, bullet lists, spaces, tabs, etc from a string


I'm making a custom document compare tool. I am comparing the content from a word document to a webpages content. I am parsing the webpage and just removing the text and comparing it to text I am copying from a word document into a JTextarea.

All I want to do is compare the text, make sure that there are no spelling mistakes or missing words. When I parse the webpage I don't get any formatting like numbered or bulleted lists. My problem is, is when I copy the contents of my word doc to my jtextarea it preserves all the numbered lists, bulleted lists etc.

What I want is to take to following text example:

Solution 1: Restart your network hardware

If Xbox LIVE performance seems slow, try restarting your network hardware. Here’s how:

  1. Turn off your Xbox 360 console and any network hardware (for example, your modem and router).
  2. Wait 30 seconds.
  3. Turn on your modem, and wait one minute.

And turn it into:

Solution 1: Restart your network hardware
If Xbox LIVE performance seems slow, try restarting your network hardware. Here’s how:
Turn off your Xbox 360 console and any network hardware (for example, your modem and router).
Wait 30 seconds.
Turn on your modem, and wait one minute.

I already have a regex to remove all the blank lines, I just don't know how I should handle removing the extra tabs, list styles etc. Has anyone any suggestions?


Solution

  • You can try the following heuristics:

    • replace all Tabs (etc.) with space (see String.replaceAll())
    • replace all spaces-followed-by-number-followed-by-dot-at-the-begining-of-line with space (see regex: |^ *\d*\\.| -- carret-space-star-backslash-d-backslash-backslash-dot)
    • replace all series-of-spaces (regex: | +| -- space-spaces-plus) with one space (to remove excess) -- keep this as the last step

    you can add any other replacement logic there if you encounter other patterns you don't want

    Note: I added | around the regular expressions to make the leading spaces easier to see, but they are not part of the regex when you enter the code.