Search code examples
powershellsearchget-childitemrunspaceselect-string

Using Powershell runspaces to search large number of files (.XML)


I have a script that will look for a regex inside a large number of files, such as an address or phonenumber. The script i currently have runs as a job and works, however very slowly.

Currently my method of start-job works as expected, all be it slowly. Im looking for ways to speed up and returning results quicker. If at all possible

I have ventured into the world of Runspaces within powershell after browsing around for various help. Below is the code i have mashed together with brief understanding in the use of Runspaces.

My question is around the way that Runspaces can be used so that a Get-Childitem request running in parallel will not be scanning the same file across multiple runspaces. If this is even possible?

I created 20,000 files containing junk, and manually edited 2 files with the word "KETCHUP!" inside.

10k files are .xml 10k files are .txt

Im trying not to use PS v7 -parallel parameters as i would like to hand my script/GUI to other members of staff that are not in IT and will not have higher than ISE installed

powershell searching for a phrase in a large amount of files fast

$Finished.text = 'Working.....' 

#Get list of files to search through
$path = "C:\intel\spam"
Push-Location $path
$FILES = Get-ChildItem -filter *.XML -File

### 5 Runspace limit
$RunspacePool = [RunspaceFactory]::CreateRunspacePool(1,5)
$RunspacePool.ApartmentState = "MTA"
$RunspacePool.Open()
$runspaces = @()

# Setup scriptblock
$scriptblock = {
   Param (
    [object]$files
   )
    foreach($file in $files){
  $test = select-string  -Path $file -Pattern 'KETCHUP!' -List | select-object FileName,Path
  
  if($test)
{
add-content -Path 'C:\intel\matches.txt' -Value $test.Filename
}
    }
                }


Write-Output "Starting search..."

       $runspace = [PowerShell]::Create()
       [void]$runspace.AddScript($scriptblock)
       [void]$runspace.AddArgument($FILES) # <-- Send files to be searched
       $runspace.RunspacePool = $RunspacePool

    $AsyncObject = $runspace.BeginInvoke() 

# Wait for runspaces to complete
while ($runspaces.Status.IsCompleted -notcontains $true) {}

# Cleanup runspaces 
foreach ($runspace in $runspaces ) { 
    $runspace.Pipe.EndInvoke($runspace.Status) 
    $runspace.Pipe.Dispose()
}

# Cleanup runspace pool
$RunspacePool.Close() 
$RunspacePool.Dispose()


  $Data = $runspace.EndInvoke($AsyncObject)

Pop-Location 


Solution

  • My question is around the way that Runspaces can be used so that a Get-Childitem request running in parallel will not be scanning the same file across multiple runspaces. If this is even possible?

    It really comes down to logic, and that would be in breaking the files to search for in chunks - which is totally doable. The way it works is more or less like this. Let's imagine you have 8 files and a hypothetical 2-core CPU:

    [File1] [File2] [File3] [File4] [File5] [File6] [File7] [File8]     <- All files from Get-ChildItem
    

    After determining the chunk_size (which would be 4 in this hypothetical scenario since 8 files divided by 2 cores is 4), the code would divide these files into chunks:

    Chunk 1: [File1] [File2] [File3] [File4]
    Chunk 2: [File5] [File6] [File7] [File8]
    

    This division would be stored in the $file_chunks ArrayList:

    $file_chunks:
    Index 0: [File1] [File2] [File3] [File4]
    Index 1: [File5] [File6] [File7] [File8]
    

    Now, when parallel processing begins, each CPU core (or runspace) picks up a chunk:

    CPU Core 1 (Runspace 1): Processing [File1] [File2] [File3] [File4]
    CPU Core 2 (Runspace 2): Processing [File5] [File6] [File7] [File8]
    

    Each core works on its own subset of files, allowing for faster parallel processing.

    With this said and done, you can create a more robust solution such as a function to re-use it in a more friendly manner:)

    function Search-Files {
        Param(
            [Parameter()]
            [string]$Path,
    
            [Parameter()]
            [string[]]$Filter = @('*.txt', '*.xml'),
    
            [Parameter()]
            [string]$Pattern,
    
            [Parameter()]
            [switch]$Recurse
        )
    
        $regex = [regex]::new($pattern, [System.Text.RegularExpressions.RegexOptions]::Compiled)
        $file_list = [System.Collections.Generic.List[string]]::new()
    
        $searchOption = if ($Recurse) { [System.IO.SearchOption]::AllDirectories } else { [System.IO.SearchOption]::TopDirectoryOnly }
        foreach ($find in $Filter) 
        {
            try 
            {
                $file_list.AddRange([System.IO.Directory]::EnumerateFiles($Path, $find, $searchOption))
            } 
            catch 
            {
                Write-Warning "An error occurred while fetching files with filter ${find}: $_"
            }
        }
    
    
        $file_count = $file_list.Count
        $cpu_count = [Environment]::ProcessorCount
        $optimal_runspaces = [Math]::Min($cpu_count, $file_count)
        $file_chunks = [System.Collections.Generic.List[string[]]]::new($optimal_runspaces)
    
        $runspace_pool = [runspacefactory]::CreateRunspacePool(1, $optimal_runspaces)
        $runspace_pool.Open()
    
        $chunk_size = [Math]::Ceiling($file_count / $optimal_runspaces)
        for ($i = 0; $i -lt $optimal_runspaces; $i++) 
        {
            $start = $i * $chunk_size
            $end   = [Math]::Min(($start + $chunk_size - 1), ($file_count - 1))
    
            $file_chunks.Add($file_list[$start..$end])
        }
    
        $scriptblock = {
            Param($files, $regex)
    
            $results = [System.Collections.Generic.List[string]]::new()
            foreach ($file in $files) 
            {
                try 
                {
                    $reader = [System.IO.File]::OpenText($file)
                    while ($reader.Peek() -ge 0) 
                    {
                        $line = $reader.ReadLine()
                        if ($regex.IsMatch($line)) 
                        {
                            $results.Add($file)
                            break
                        }
                    }
                } 
                finally 
                {
                    if ($reader) 
                    {
                        $reader.Dispose()
                    }
                }
            }
            return $results
        }
    
        $runspaces = @{}
        foreach ($chunk in $file_chunks) 
        {
            $runspace = [powershell]::Create().AddScript($scriptblock).AddArgument($chunk).AddArgument($regex)
            $runspace.RunspacePool = $runspace_pool
            $runspaces[$runspace] = $runspace.BeginInvoke()
        }
    
        # Wait for all runspaces to complete
        while ($runspaces.Values | Where-Object { -not $_.IsCompleted }) 
        {
            Start-Sleep -Milliseconds 100
        }
    
        $all_results = foreach ($runspace in $runspaces.GetEnumerator()) 
        {
            $runspace.Key.EndInvoke($runspace.Value)
        }
    
        $runspace_pool.Close()
        $runspace_pool.Dispose()
    
        # Flatten the results for a single list of matched file paths
        return $all_results
    }
    
    # Usage
    $path = "C:\intel\spam"
    $filter = "*.xml"
    $pattern = "KETCHUP!"
    $results = Search-Files -Path $path -Filter $filter -Pattern $pattern
    
    $results | ForEach-Object { Add-Content -Path 'C:\intel\matches.txt' -Value $_ }