Search code examples
powershelltext-filesexecdata-manipulationin-place

Powershell remove duplicate lines in a folder that contains multiple text files


I need, if possible, to remove, in place, duplicate lines in a path with multiple text files, in powershell.

I've found a way to get the list of lines:

Get-Content "$path\*.*" | Group-Object | Where-Object { $_.Count -gt 1 } | Select -ExpandProperty Name

Now I think that a foreach loop will useful but, I don't know how to handle the remove action in place...

Can Someone help me please?

EDIT: I've changed the Title of question due to avoid misunderstanding!

EDIT 2 (based on Olaf hint):

PS C:\Users\Robbi> $mypath = "F:\DATA\Urls_CP"
PS C:\Users\Robbi> Get-ChildItem -Path $mypath -Filter * |
>>     ForEach-Object{
>>         $Content =
>>         Get-Content -Path $_.FullName | Sort-Object -Unique
>>         $Content | Out-File -FilePath $_.FullName
>>     }

PS C:\Users\Robbi> Get-Content $mypath\* | Select-String "https://httpd.apache.org/docs/2.4/mod/mod_md.html"

https://httpd.apache.org/docs/2.4/mod/mod_md.html
https://httpd.apache.org/docs/2.4/mod/mod_md.html

But something has changed, I've copied the original folder named "Urls", and ran your code on copied folder "Urls_CP"; "Urls_CP" size is about 200kb bigger than original "Urls"!

Just for info, each file is a, powershell manipulated, "access.log" of Squid proxy from linux vm, but I've checked encoding and the presence of "strange" chars with notepad++. (I haven't access to linux shell)

This is an extract of one files inside the "Urls" folder:

https://community.checkpoint.com/t5/API-CLI-Discussion-and-Samples/can-anybody-let-me-know-how-can-we-import-policy-rules-via-csv/td-p/20839
https://community.checkpoint.com/t5/API-CLI-Discussion-and-Samples/Python-tool-for-exporting-importing-a-policy-package-or-parts-of/td-p/41100
https://community.checkpoint.com/t5/General-Management-Topics/R80-10-API-bug-fallback-to-quot-SmartCenter-Only-quot-after/m-p/5074
https://github.com/CheckPointSW/cp_mgmt_api_python_sdk
https://github.com/CheckPointSW/cpAnsible/issues/2
https://github.com/CheckPointSW/ExportImportPolicyPackage/issues
https://stackoverflow.com/questions/15031694/installing-python-packages-from-local-file-system-folder-to-virtualenv-with-pip
https://stackoverflow.com/questions/24627525/fatal-error-in-launcher-unable-to-create-process-using-c-program-files-x86
https://stackoverflow.com/questions/25749621/whats-the-difference-between-pip-install-and-python-m-pip-install
https://stackoverflow.com/questions/42494229/how-to-pip-install-a-local-python-package

EDIT 3:

Please forgive me, I'll try to explain me better!

I would maintain the structure of "Urls" folder, that contains multiple files; I would remove (or replace with "$ null") the duplicates "on an all-files basis" but preserving each file in the folder, ie: not one big file with all http address inside! In the EDIT 2 I've show to Olaf that the string "https://httpd.apache.org/docs/2.4/mod/mod_md.html" are still duplicated, because it is present in "$mypath\file1.txt" and in file "$mypath\file512.txt"! I've understand that Olaf's code check for duplicates "on a per-file basis" (thanks to @Lee_Dailey I've got wath is unclear in my question!)

EDIT 4:

$SourcePath = 'F:\DATA\Urls_CP'
$TargetPath = 'F:\DATA\Urls_CP\DeDupe'

$UrlList = Get-ChildItem -Path $SourcePath -Filter *.txt |
    ForEach-Object {
        $FileName = $_.BaseName
        $FileLWT = (Get-ItemProperty $_.FullName).LastWriteTime
        Get-Content -Path $_.FullName -Encoding default |
            ForEach-Object {
                [PSCustomObject]@{
                    URL = $_
                    File = $FileName
                    LWT = $FileLWT
                }
            }
    }

$UrlList | 
    Sort-Object -Property URL -Unique |
        ForEach-Object {
            $TargetFile = Join-Path -Path $TargetPath -ChildPath ($_.File + '.txt')
            $_.URL | Out-File -FilePath $TargetFile -Append -Encoding default
            Set-ItemProperty $TargetFile -Name LastWriteTime -Value $_.LWT
        }

Solution

  • Your explanation from Edit #3 makes even less sense I think. What is this task actually for?

    $SourcePath = 'F:\DATA\Urls_CP'
    $TargetPath = 'F:\DATA\Urls_CP\DeDupe'
    
    $UrlList = Get-ChildItem -Path $SourcePath -Filter *.log |
        ForEach-Object {
            $FileName = $_.BaseName
            Get-Content -Path $_.FullName -Encoding default |
                ForEach-Object {
                    [PSCustomObject]@{
                        URL = $_
                        File = $FileName
                    }
                }
        }
    
    $UrlList | 
        Sort-Object -Property URL -Unique |
            ForEach-Object {
                $TargetFile = Join-Path -Path $TargetPath -ChildPath ($_.File + '.log')
                $_.URL | Out-File -FilePath $TargetFile -Append -Encoding default
            }
    

    The target folder has to exist in advance.