I'm making a custom document compare tool. I am comparing the content from a word document to a webpages content. I am parsing the webpage and just removing the text and comparing it to text I am copying from a word document into a JTextarea.
All I want to do is compare the text, make sure that there are no spelling mistakes or missing words. When I parse the webpage I don't get any formatting like numbered or bulleted lists. My problem is, is when I copy the contents of my word doc to my jtextarea it preserves all the numbered lists, bulleted lists etc.
What I want is to take to following text example:
Solution 1: Restart your network hardware
If Xbox LIVE performance seems slow, try restarting your network hardware. Here’s how:
- Turn off your Xbox 360 console and any network hardware (for example, your modem and router).
- Wait 30 seconds.
- Turn on your modem, and wait one minute.
And turn it into:
Solution 1: Restart your network hardware
If Xbox LIVE performance seems slow, try restarting your network hardware. Here’s how:
Turn off your Xbox 360 console and any network hardware (for example, your modem and router).
Wait 30 seconds.
Turn on your modem, and wait one minute.
I already have a regex to remove all the blank lines, I just don't know how I should handle removing the extra tabs, list styles etc. Has anyone any suggestions?
You can try the following heuristics:
String.replaceAll()
)|^ *\d*\\.|
-- carret-space-star-backslash-d-backslash-backslash-dot)| +|
-- space-spaces-plus) with one space (to remove excess) -- keep this as the last stepyou can add any other replacement logic there if you encounter other patterns you don't want
Note: I added |
around the regular expressions to make the leading spaces easier to see, but they are not part of the regex when you enter the code.