I need some leads for tools in PHP and/or java (Spring + Hibernate currently) to use for hyphenation of content. I have some text content in included files and some in a database. All text is utf-8 encoded and I need soft hyphens as the support for that is common in most browsers.
So this stored original:
<p> These words need hyphenation</p>
would turn up something like this
<p> The­se wor­ds need hyp­he­na­tion</p>
in the source of the finally loaded web page.
Any ideas how to achieve this?
Suggestions for text edit tools that includes hyphenation within HTML mark up would also be welcome for situations where there isn't any server-side code in use and only plain HTML source files.
Also, I have yet to find a good source for hyphenation word lists.
CSS3 defines client-side hyphenation.
This means that in supporting browsers¹, you only need to specify the language of your text and your desire for automatic hyphenation and it will be hyphenated automatically without any work on your part. Obviously this means that hyphenation points are controlled by the browser's linguistic resources.
For manual control, you can place discretionary hyphens at every hyphenation point that you wish to use and direct the browser to use only those.
In practice, to find hyphenation points and insert discretionary hyphens, the best course would probably be to use the venerable TeX-style hyphenation method where subword patterns specifying hierarchical hyphenation or no-hyphenation points are matched against the word to hyphenate. These patterns are now widely used (including by OpenOffice, LibreOffice and Adobe InDesign) and are available for most languages.
Implementing the algorithm only takes a few lines of code. What's more, there are ready-made implementations in numerous languages: PHP implementations like phpHyphenator, Java implementations like TeXHyphenator-J or Hyphenation and Java bindings for the C++ implementation of libhyphen like jhyphen.
¹ Currently, Firefox, Safari and IE have autohyphenation support, Chrome and Opera don't.