Search code examples
c#regexword-wrap

Word Wrapping with Regular Expressions


EDIT FOR CLARITY - I know there are ways to do this in multiple steps, or using LINQ or vanilla C# string manipulation. The reason I am using a single regex call, is because I wanted practice with complex regex patterns. - END EDIT

I am trying to write a single regular expression that will perform word wrapping. It's extremely close to the desired output, but I can't quite get it to work.

Regex.Replace(text, @"(?<=^|\G)(.{1,20}(\s|$))", "$1\r\n", RegexOptions.Multiline)

This is correctly wrapping words for lines that are too long, but it's adding a line break when there already is one.

Input

"This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long."

Expected Output

"This string is \r\nreally long. There \r\nare a lot of words \r\nin it.\r\nHere's another line \r\nin the string that's \r\nalso very long."

Actual Output

"This string is \r\nreally long. There \r\nare a lot of words \r\nin it.\r\n\r\nHere's another line \r\nin the string that's \r\nalso very long.\r\n"

Note the double "\r\n" between sentences where the input already had a line break and the extra "\r\n" that was put at the end.

Perhaps there's a way to conditionally apply different replacement patterns? I.E. If the match ends in "\r\n", use replace pattern "$1", otherwise, use replace pattern "$1\r\n".

Here's a link to a similar question for wrapping a string with no white space that I used as a starting point. Regular expression to find unbroken text and insert space


Solution

  • This was quick-tested in Perl.

    Edit - This regex code simulates the word wrap used (good or bad) in MS-Windows Notepad.exe

     # MS-Windows  "Notepad.exe Word Wrap" simulation
     # ( N = 16 )
     # ============================
     # Find:     @"(?:((?>.{1,16}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,16})(?:\r?\n)?|(?:\r?\n|$))"
     # Replace:  @"$1\r\n"
     # Flags:    Global     
    
     # Note - Through trial and error discovery, it apparears Notepad accepts an extra whitespace
     # (possibly in the N+1 position) to help alignment. This matters not because thier viewport hides it.
     # There is no trimming of any whitespace, so the wrapped buffer could be reconstituted by inserting/detecting a
     # wrap point code which is different than a linebreak.
     # This regex works on un-wrapped source, but could probably be adjusted to produce/work on wrapped buffer text.
     # To reconstitute the source all that is needed is to remove the wrap code which is probably just an extra "\r".
    
     (?:
          # -- Words/Characters 
          (                       # (1 start)
               (?>                     # Atomic Group - Match words with valid breaks
                    .{1,16}                 #  1-N characters
                                            #  Followed by one of 4 prioritized, non-linebreak whitespace
                    (?:                     #  break types:
                         (?<= [^\S\r\n] )        # 1. - Behind a non-linebreak whitespace
                         [^\S\r\n]?              #      ( optionally accept an extra non-linebreak whitespace )
                      |  (?= \r? \n )            # 2. - Ahead a linebreak
                      |  $                       # 3. - EOS
                      |  [^\S\r\n]               # 4. - Accept an extra non-linebreak whitespace
                    )
               )                       # End atomic group
            |  
               .{1,16}                 # No valid word breaks, just break on the N'th character
          )                       # (1 end)
          (?: \r? \n )?           # Optional linebreak after Words/Characters
       |  
          # -- Or, Linebreak
          (?: \r? \n | $ )        # Stand alone linebreak or at EOS
     )
    

    Test Case The wrap width N is 16. Output matches Notepad's and over a variety of widths.

     $/ = undef;
    
     $string1 = <DATA>;
    
     $string1 =~ s/(?:((?>.{1,16}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,16})(?:\r?\n)?|(?:\r?\n|$))/$1\r\n/g;
    
     print $string1;
    
     __DATA__
     hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
     bbbbbbbbbbbbbbbbEDIT FOR CLARITY - I                    know there are  ways to do this in   multiple steps, or using LINQ or vanilla C#
     string manipulation. 
    
     The reason I am using a single regex call, is because I wanted practice. with complex
     regex patterns. - END EDIT
     pppppppppppppppppppUf
    

    Output >>

     hhhhhhhhhhhhhhhh
     hhhhhhhhhhhhhhh
     bbbbbbbbbbbbbbbb
     EDIT FOR CLARITY 
     - I              
           know there 
     are  ways to do 
     this in   
     multiple steps, 
     or using LINQ or 
     vanilla C#
     string 
     manipulation. 
    
     The reason I am 
     using a single 
     regex call, is 
     because I wanted 
     practice. with 
     complex
     regex patterns. 
     - END EDIT
     pppppppppppppppp
     pppUf