Search code examples
powershelltext-filesuniquecpu-word

how to find unique word in text file and then store unique words in text file using powershell


I am using PowerShell. Here I want to remove duplicate words from the text file and then store the unique words in the text file. What I do is here.

$A = $( foreach ($line in Get-Content C:\Test1\File1.txt) {
    $line.tolower().split(" ")
  }) | Sort-Object | Get-Unique
$A | export-csv "somefile.csv"

Here is my file.


Solution

  • PowerShell can use a dotnet type called a hashset which is perfect for doing exactly this, and at the figurative speed of light too!

    First we read the file into memory in PowerShell and assign it to a variable called $lines.

    Next, we split into just the unique $words.

    Finally, we create a hashset which will only allow unique words or items.

    $lines = get-content "C:\Users\Stephen\OneDrive\Documents\quotes.txt"
    [string[]]$words = $lines.Split()
    $uniqueWords = [System.Collections.Generic.HashSet[string]]::new($words)
    

    Here's some info on how this works, we're using the hashset constructor which accepts an input value.

    But its FAST!

    It is amazingly fast to use a hashset too! I measured the performance on a reasonably sized file of 10MB of text from samplefile.com with a number of famous quotes and other info.

    Method           TotalMs
    ------           -------
    Get-Unique    21484.4956
    Using Hashset  1840.7407
    

    Get hashset is dramatically faster. It's an order of magnitude faster in the worst case, and I've seen it be two orders of magnitude or more before.