Search code examples
javahtmlstringline-breakspdfbox

How to prevent CR/LF?


I am reading from a pdf using pdfbox and apparently, at least on a Windows-based framework, for the line break it uses a unicode as such 
&#10.

My question is that how can I prevent this line breaking character to be concatenated to the string in below code?

tokenizer =new StringTokenizer(Text,"\\.");
while(tokenizer.hasMoreTokens())
{
    String x= tokenizer.nextToken();
    flag=0;
    for(final String s :x.split(" ")) {
       if(flag==1)
          break;
       if(Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
          sum+=x+"."; //here need first to check for "
&#10"
                      // before concatenating the String "x" to String "sum"
          flag=1;
       }
   }
}

Solution

  • You should discard the line separators when you split; e.g.

    for (final String s : x.split("\\s+")) {
    

    That is making the word separator one or more whitespace characters.

    (Using trim() won't work in all cases. Suppose that x contains "word\r\nword". You won't split between the two words, and s will be "word\r\nword" at some point. Then s.trim() won't remove the line break characters because they are not at the ends of the string.)


    UPDATE

    I just spotted that you are actually appending x not s. So you also need to do something like this:

    sum += x.replaceAll("\\s+", " ") + "."
    

    That does a bit more than you asked for. It replaces each whitespace sequence with a single space.


    By the way, your code would be simpler and more efficient if you used a break to get out of the loop rather than messing around with a flag. (And Java has a boolean type ... for heavens sake!)

       if (Keyword.toLowerCase().equals(s.toLowerCase()) && !"".equals(s)) {
           sum += ....
           break;
       }