Search code examples
regexwindowspowershellcommand-linefindstr

Regexp in findstr to find URLs in txt files in all subfolders


I struggle with a pretty simple CMD task.

I have a root folder (C:\folder) in which I have many subfolders and each of them contains different kind of files. I want to search all txt files in all subfolders to find URL links. At the end I want to put of all links into a single file. My regexp to find URL is like:

(https?|ftp|file):\/\/\)?[-A-Za-z0-9+&@#\/%?=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]

and it works

My last idea was:

for /R C:\folder %%F in (*.txt) do (
   findstr /r "(https?|ftp|file):\/\/\)?[-A-Za-z0-9+&@#\/%?=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]"  >> results.txt
)

Can you help me? What am I missing?


Solution

  • I am not sure that this regex is a universal URL identifier, but if you want to put it into a PowerShell command:

    Get-ChildItem -Recurse -File -Filter '*.txt' |
       Select-String -Pattern '(https?|ftp|file):\/\/\)?[-A-Za-z0-9+&@#\/%?=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]'
    

    As suggested by @mklement0:

    Get-ChildItem -Recurse -File -Filter '*.txt' |
        Select-String -Pattern '(https?|ftp|file):\/\/\)?[-A-Za-z0-9+&@#\/%?=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]' |
        ForEach-Object { $_.Matches.Value }
    

    and:

    Get-ChildItem -Recurse -File -Filter '*.txt' |
        Select-String -Pattern '(https?|ftp|file):\/\/\)?[-A-Za-z0-9+&@#\/%?=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]' |
        ForEach-Object { $_.Matches.Value } >results.txt
    

    I would not put the results.txt file in the same directory, since it will be included if the command is run again. Perhaps placing it in the home directory.

    Get-ChildItem -Recurse -File -Filter '*.txt' |
        Select-String -Pattern '(https?|ftp|file):\/\/\)?[-A-Za-z0-9+&@#\/%?=~_|!:,.;]+[-A-Za-z0-9+&@#\/%=~_|]' |
        ForEach-Object { $_.Matches.Value } |
        Out-File -Path '~/results.txt'