Search code examples
stringpowershell.net-corereplacetruncate

Create a string synopsis


Given a unknown string with an unknown size, e.g. a ScriptBlock expression or something like:

$Text = @'
LOREM IPSUM

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
'@

I would like to summarize the string to a single line (replace all the consecutive white spaces to a single white space) and truncate it to a specific $Length:

$Length = 32
$Text = $Text -Replace '\s+', ' '
if ($Text.Length -gt $Length) { $Text = $Text.SubString(0, $Length) }
$Text
LOREM IPSUM Lorem Ipsum is simpl

The issue is that if it concerns a large string, it isn't very effective towards replacing the white spaces: it replaces all white spaces in the whole $Text string where only need to replace the first few white spaces till I have a string of the required size ($Length = 32).
Swapping the -replace and SubString operations isn't desired as well as that would return a lesser length than required or even a single space for any $Text string that starts with something like 32 white spaces.

Question:
How can I effectively merge the two (-replace and SubString) operations so that I am not replacing more white spaces than necessarily and get a string of the required length (in case the $Text string is larger than the required length)?


Update

I think I am close by using a MatchEvaluator Delegate:

$Length = 8
$TotalSpaces = 0
$Delegate = {
    if ($Args[0].Index - $TotalSpaces -gt $Length) {
        '{break}'
        ([Ref]$TotalSpaces).Value = [int]::MaxValue
    }
    else { ([Ref]$TotalSpaces).Value += $Args[0].Value.Length }
}
[regex]::Replace('test 0 1 2 3 4 5 6 7 8 9', '\s+', $Delegate)
test01234{break}56789

Now the question is how can I break the regex processing at the {break}?
Note that for performance reasons I really want to break out and not substitute the <regular-expression> with the found match (which makes it look like it stopped).


Solution

  • Perhaps a more manual approach is faster than trying to do it with regex, of course it's a lot more code.

    $Text = @'
    LOREM IPSUM
    Lorem   Ipsum is
       simply dummy    text
    '@
    
    $Length = 32
    $sb = [System.Text.StringBuilder]::new($Length)
    
    foreach ($char in $Text.GetEnumerator()) {
        if ($sb.Length -eq $Length) {
            break
        }
    
        if ([char]::IsWhiteSpace($char)) {
            if (-not $prevSpace) {
                $sb = $sb.Append(' ')
            }
    
            $prevSpace = $true
            continue
        }
    
        $sb = $sb.Append($char)
        $prevSpace = $false
    }
    
    $sb.ToString()
    

    Very similar approach using String.Create might probably be even faster but will need pre-compile or Add-Type it. You can find an example here.