I am using PowerShell. Here I want to remove duplicate words from the text file and then store the unique words in the text file. What I do is here.
$A = $( foreach ($line in Get-Content C:\Test1\File1.txt) {
$line.tolower().split(" ")
}) | Sort-Object | Get-Unique
$A | export-csv "somefile.csv"
Here is my file.
PowerShell can use a dotnet type called a hashset which is perfect for doing exactly this, and at the figurative speed of light too!
First we read the file into memory in PowerShell and assign it to a variable called $lines
.
Next, we split into just the unique $words
.
Finally, we create a hashset which will only allow unique words or items.
$lines = get-content "C:\Users\Stephen\OneDrive\Documents\quotes.txt"
[string[]]$words = $lines.Split()
$uniqueWords = [System.Collections.Generic.HashSet[string]]::new($words)
Here's some info on how this works, we're using the hashset constructor which accepts an input value.
But its FAST!
It is amazingly fast to use a hashset too! I measured the performance on a reasonably sized file of 10MB of text from samplefile.com with a number of famous quotes and other info.
Method TotalMs
------ -------
Get-Unique 21484.4956
Using Hashset 1840.7407
Get hashset is dramatically faster. It's an order of magnitude faster in the worst case, and I've seen it be two orders of magnitude or more before.