java html html-parsing jericho-html-parser

Remove other language space in html

I want to trim the space between the tag name and the attribute using StringUtils.strip(). Because I have some space which cannot be removed by the following Jericho methods:

CharacterReference.decodeCollapseWhiteSpace(htmlFragment))
TextExtractor -Tag[] allTags = source.fullSequentialParse();

the first method removes the normal space but not the other language space. This is the error I am getting. for example

html = "<a　　　href=\"test.html\"><font></font></a>";

StartTag a at (r1,c1,p0) rejected because the name contains an invalid character at position (r1,c3,p2)
Encountered possible StartTag at (r1,c1,p0) whose content does not match a registered StartTagType

there is also a generateHTML method in jericho but we have to provide all the attribute values etc

public static java.lang.String generateHTML(java.util.Map<java.lang.String,java.lang.String> attributesMap)

In full sequential parse it does not recognise the other language space.

How can I remove other language space ONLY between the tag name and attribute? ( other language space in between the attribute value is OK) that is why I cannot do string.replaceALL()

Solution

You can use String.replaceAll().

    String html = "<a　　　href=\"test.html\">　　　<font></font></a>";
    System.out.println(html.replaceAll("(?<=<\\w{1,100})[\\s\\u3000]+", " "));
    // -> <a href="test.html">　　　<font></font></a>

This code replaces all spaces including \u3000 (ideographic space) by one space. The spaces must be preceded by <ELEMENT_NAME. But the preceding is not replaced. (See "zero-width positive lookbehind" in Class Pattern) The length of ELEMENT_NAME is limited between 1 to 100 in this code.