Search code examples
batch-filecmdfindstr

Findstr: Search list of strings in folder of txt files


I'm trying to use FINDSTR to search through a folder full of text files, using a text file of strings, then output to results.txt

The text file of strings contains 3,200 lines, each containing an authors name and associated book title. Examples:

George Orwell 1984
H. G. Wells War of the Worlds
Isaac Asimov I, Robot

I also have a folder containing a dozen text lists of ebook filenames (Some of the lists have over 500K lines.), for example:

George Orwell - 1984 (epub).rar
H G Wells - War of the Worlds (pdf).rar
Isaac Asimov - [Robot 0.1] - I, Robot (Mobi).rar

I need to search the text files of filenames for the 3,200 author and titles, and output the results to a 3rd text list.

The filenames also contain other stuff like series info, format, etc, so I'm looking for any lines that contain those authors names and titles but are not exact matches to the search strings, as in my examples above.

This is what I've tried. It matches exact strings OK but I cannot see how to make it find the filenames that contain other stuff as well as all the words in the search strings.

findstr /g:C:\strings.txt *.txt >>C:\results.txt

Can anyone please help me out with the code. Thanks.


Solution

  • This find in files requires a regular expression search because of the strings in strings.txt do not exist 1:1 in *.txt files.

    It is necessary to change the strings in strings.txt from

    George Orwell 1984
    H. G. Wells War of the Worlds
    Isaac Asimov I, Robot
    

    to

    George.*Orwell.*1984
    H.*G.*Wells.*War.*of.*the.*Worlds
    Isaac.*Asimov.*I.*Robot
    

    This can be done by opening strings.txt in a text editor with Perl regular expression support and run from top of the file a Perl regular expression replace all with search string [^\w\r\n]+ and replace string .*. The search expression results in searching for one or more characters not being a word character, a carriage return or a line-feed.

    Then it is possible to use:

    findstr /I /R /G:C:\Temp\strings.txt *.txt >>C:\Temp\results.txt
    

    strings.txt and results.txt should not be in current directory containing the *.txt files searched by FINDSTR or a different file extension than .txt is used for these two files.