Search code examples
stringpowershellstring-comparisonpartial

Powershell partial string comparison


I'm currently stuck on a specific comparison problem. I have two CSV files which contain application names and I need to compare both csvs for matching names. Of course that would be easy if the applications were written the same ways in both csvs, but they're not.

Each csv has two columns but only the first column contains tha application names. In csv01 an app is called "Adobe Acrobat Reader DC Continuous MUI" while the same app in csv02 is called "Adobe Acrobat Reader DC v2022.002.20191". By looking at the files, I know both contain "Adobe Reader DC". But I'd like to automate th comparison as the csvs contains hundreds of apps.

I initially thought I'd run a nested foreach loop, taking the first product in csv01 and comparing every app in csv02 to that product to see if I have a match. I did that by splitting the application names at each space character and came up with the following code:

# Define the first string
$Products01 = Import-CSV 'C:\Temp\ProductsList01.csv' -Delimiter ";"

# Define the second string
$Products02 = Import-CSV 'C:\Temp\ProductList02.csv' -Delimiter ";"

# Flag to track if all parts of string2 are contained within string1
$allPartsMatch = $true

# Create Hashtable for results
$MatchingApps = @{}


# Loop through each part of string2
foreach ($Product in $Products01.Product) {

    Write-Host "==============================="
    Write-Host "Searching for product: $Product"
    Write-Host "==============================="

    # Split the product name into parts
    $ProductSplit = $Product -split " "

    Write-Host "Split $Product into $ProductSplit"

    foreach ($Application in $Products02.Column1) {
    
        Write-Host "Getting comparison app: $Application"

        # Split the product name into parts
        $ApplicationSplit = $Application -split " "

        Write-Host "Split comparison App into: $ApplicationSplit"
        
        # Check if the current part is contained within string1
        if ($ProductSplit -notcontains $ApplicationSplit) {
            # If the current part is not contained within string1, set the flag to false
            $allPartsMatch = $false
        }
    }
    # Display a message indicating the result of the comparison
    if ($allPartsMatch) {
        Write-Host "==============================="
        Write-Host "$Application is contained within $Product"
        Write-Host "==============================="
        
        $MatchingApps += @{Product01 = $Product; Product02 = $Application}
    } else {
        #Write-Host "$Application is not contained within $Product"
    }
}

However, I seem to have a logic error in my thought process as this returns 0 matches. So obviously, the script isn't properly splitting or comparing the split items.

My question is - how do compare the parts of both app names to see if I have the apps in both csvs? Can I use a specific regex for that or do I need to approach the problem differently?

Cheers,

Fred

I tried comparing both csv files for similar product names. I expected a table of similar product names. I received nothing.


Solution

  • The basis for "matching" one string to another is that they share a prefix - so start by writing a small function that extracts the common prefix of 2 strings, we'll need this later:

    function Get-CommonPrefix {
      param(
        [string]$A,
        [string]$B
      )
    
      # start by assuming the strings share no common prefix
      $prefixLength = 0
    
      # the maximum length of the shared prefix will at most be the length of the shortest string
      $maxLength = [Math]::Min($A.Length, $B.Length)
    
      for($i = 0; $i -lt $maxLength; $i++){
        if($A[$i] -eq $B[$i]){
          $prefixLength = $i + 1
        }
        else {
          # we've reached an index with two different characters, common prefix stops here 
          break
        }
      }
    
      # return the shared prefix
      return $A.Substring(0, $prefixLength)
    }
    

    Now we can determine the shared prefix between two strings:

    PS ~> $sharedPrefix = Get-CommonPrefix 'Adobe Acrobat Reader DC Continuous MUI' 'Adobe Acrobat Reader DC v2022.002.20191'
    PS ~> Write-Host "The shared prefix is '$sharedPrefix'"
    The shared prefix is 'Adobe Acrobat Reader DC '
    

    Now we just need to put it to use in your nested loop:

    # Import the first list
    $Products01 = Import-CSV 'C:\Temp\ProductsList01.csv' -Delimiter ";"
    
    # Import the second list
    $Products02 = Import-CSV 'C:\Temp\ProductList02.csv' -Delimiter ";"
    
    # now let's find the best match from list 2 for each item in list 1:
    foreach($productRow in $Products01) {
      # we'll use this to keep track of shared prefixes encountered
      $matchDetails = [pscustomobject]@{
        Index = -1
        Prefix = ''
        Product2 = $null
      }
    
      for($i = 0; $i -lt $Products02.Count; $i++) {
        # for each pair, start by extracting the common prefix and see if we have a "better match" than previously
        $commonPrefix = Get-CommonPrefix $productRow.Product $Products02[$i].Product
        if($commonPrefix.Length -gt $matchDetails.Prefix.Length){
          # we found a better match!
          $matchDetails.Index = $i
          $matchDetails.Prefix = $commonPrefix
          $matchDetails.Product2 = $Products02[$i]
        }
      }
    
      if($matchDetails.Index -ge 0){
        Write-Host "Best match found for '$($productRow.Product)': '$($matchDetails.Product2.Product)' "
    
        # put code that needs to work on both rows here ...
      }
    }
    

    Note: in cases where multiple entries in the second list matches the same-length prefix from the first list, the code simply picks the first match.