Search code examples
javahtmlhtml-parsingjericho-html-parser

Remove other language space in html


I want to trim the space between the tag name and the attribute using StringUtils.strip(). Because I have some space which cannot be removed by the following Jericho methods:

  • CharacterReference.decodeCollapseWhiteSpace(htmlFragment))
  • TextExtractor -Tag[] allTags = source.fullSequentialParse();

the first method removes the normal space but not the other language space. This is the error I am getting. for example

html = "<a   href=\"test.html\"><font></font></a>";

StartTag a at (r1,c1,p0) rejected because the name contains an invalid character at position (r1,c3,p2)
Encountered possible StartTag at (r1,c1,p0) whose content does not match a registered StartTagType

there is also a generateHTML method in jericho but we have to provide all the attribute values etc

public static java.lang.String generateHTML(java.util.Map<java.lang.String,java.lang.String> attributesMap)

In full sequential parse it does not recognise the other language space.

How can I remove other language space ONLY between the tag name and attribute? ( other language space in between the attribute value is OK) that is why I cannot do string.replaceALL()


Solution

  • You can use String.replaceAll().

        String html = "<a   href=\"test.html\">   <font></font></a>";
        System.out.println(html.replaceAll("(?<=<\\w{1,100})[\\s\\u3000]+", " "));
        // -> <a href="test.html">   <font></font></a>
    

    This code replaces all spaces including \u3000 (ideographic space) by one space. The spaces must be preceded by <ELEMENT_NAME. But the preceding is not replaced. (See "zero-width positive lookbehind" in Class Pattern) The length of ELEMENT_NAME is limited between 1 to 100 in this code.