Search code examples
regexcmdfindstr

Regular Expression with a Space for use in findstr command


I need help creating the Regular Expression necessary to use with the findstr command, to find all the lines in my (plain text) input files matching the following conditions:

  1. The line begins with an asterisk (*) followed by the letter P C or T followed by a number containing between 3 and 5 digits followed by a Space (followed by anything)
  2. The line begins with an asterisk (*) followed by between 2 and 10 spaces (followed by anything)

I can get close to my desired results using multiple literal strings in an input file identified with the /G option ... but I have been unable to figure out how to include the "followed by a space" part of my requirement. I can also get close using the /R option to specify Regular Expressions ... which I think will enable me to include the "followed by a space" requirement. But I need multiple expressions and so wanted to provide them on multiple lines in an input file ... and it seems to cause problems when I try to specify multiple regular expressions in an input file using the /G option. And when I try to specify multiple regular expressions (including a space character) as command line arguments, it seems to cause the findstr command to hang. So I am stuck, and am hoping someone with more experience can help me (I have very limited experience using / building Regular Expressions).

So ... the following command (using only command line arguments) gets me pretty close, but it is not perfect. Can someone explain how to modify this so that it works as needed (using only command line arguments or using an input file)?

findstr /R /B "\*[PCT][0-9] " SampleInput.txt

The lines below are the contents of the SampleInput.txt file. When I run the findstr command against this file I need to see as output only the 9 lines expected. Thanks in advance for any help / ideas / suggestions!

SampleInput.txt:
This line should not match - no asterisk
 * This line should not match because the first character is not an asterisk
*P123 This line should match because it starts with P123 (followed by a space)
*P123This line should not match because there is no space after the P123
*C1234 This line should match ... just like P123 above
*C1234This line should not match ... just like P123This above
*T12345 This line should match ... just like P123 above
*T12345This line should not match ... just like P123This above
*This line should not match because the string *T12345 is not at the beginning of the line
*TX This line should not match because the T is not followed by a number
*CX This line should not match because the C is not followed by a number
*PX This line should not match because the P is not followed by a number
* This line should NOT match because the Asterisk is followed by only one Space
*  This line should match because it starts with Asterisk Space Space
*   This line should match because it starts with Asterisk Space Space Space
*    This line should match because it starts with Asterisk Space Space Space Space
*     This line should match because it starts with Asterisk Space Space Space Space Space
*      This line should match because it starts with Asterisk Space Space Space Space Space Space
*C456... This line should not match because of the three dots (ie there is no space after the C456)
*T999 Bottom line ... 9 of the lines in this file should match (including this line)
 

Solution

  • I added some lines to the test file for completeness.

    findstr REGEX is very limited (crippled). You have to work around those restrictions e.g. search for three, four and five digits explicitly (as there is no {3,5}).

    ^ is for "beginning of line", the * has to be escaped (with a \) because it has a special meaning (zero or more of the previous char) and you have to use /c to take care of the space(s)

    extended test file:

    d:\temp>type t.txt
    *P1 this line should NOT match - less than three numbers
    *C12 this line should NOT match - less than three numbers
    *T123456 this line should NOT match - more than five numbers
    SampleInput.txt:
    This line should not match - no asterisk
     * This line should not match because the first character is not an asterisk
    *P123 This line should match because it starts with P123 (followed by a space)
    *P123This line should not match because there is no space after the P123
    *C1234 This line should match ... just like P123 above
    *C1234This line should not match ... just like P123This above
    *T12345 This line should match ... just like P123 above
    *T12345This line should not match ... just like P123This above
    *This line should not match because the string *T12345 is not at the beginning of the line
    *TX This line should not match because the T is not followed by a number
    *CX This line should not match because the C is not followed by a number
    *PX This line should not match because the P is not followed by a number
    * This line should NOT match because the Asterisk is followed by only one Space
    *  This line should match because it starts with Asterisk Space Space
    *   This line should match because it starts with Asterisk Space Space Space
    *    This line should match because it starts with Asterisk Space Space Space Space
    *     This line should match because it starts with Asterisk Space Space Space Space Space
    *      This line should match because it starts with Asterisk Space Space Space Space Space Space
    *C456... This line should not match because of the three dots (ie there is no space after the C456)
    *T999 Bottom line ... 9 of the lines in this file should match (including this line)
    

    Matching lines:

    d:\temp>type t.txt |  findstr /rc:"^\*[PCT][0-9][0-9][0-9] " /c:"^\*[PCT][0-9][0-9][0-9][0-9] " /c:"^\*[PCT][0-9][0-9][0-9][0-9][0-9] "  /c:"^\*  "
    *P123 This line should match because it starts with P123 (followed by a space)
    *C1234 This line should match ... just like P123 above
    *T12345 This line should match ... just like P123 above
    *  This line should match because it starts with Asterisk Space Space
    *   This line should match because it starts with Asterisk Space Space Space
    *    This line should match because it starts with Asterisk Space Space Space Space
    *     This line should match because it starts with Asterisk Space Space Space Space Space
    *      This line should match because it starts with Asterisk Space Space Space Space Space Space
    *T999 Bottom line ... 9 of the lines in this file should match (including this line)
    

    non-matching lines (for sake of completeness)

    d:\temp>type t.txt |  findstr /rvc:"^\*[PCT][0-9][0-9][0-9] " /c:"^\*[PCT][0-9][0-9][0-9][0-9] " /c:"^\*[PCT][0-9][0-9][0-9][0-9][0-9] "  /c:"^\*  "
    *P1 this line should NOT match - less than three numbers
    *C12 this line should NOT match - less than three numbers
    *T123456 this line should NOT match - more than five numbers
    SampleInput.txt:
    This line should not match - no asterisk
     * This line should not match because the first character is not an asterisk
    *P123This line should not match because there is no space after the P123
    *C1234This line should not match ... just like P123This above
    *T12345This line should not match ... just like P123This above
    *This line should not match because the string *T12345 is not at the beginning of the line
    *TX This line should not match because the T is not followed by a number
    *CX This line should not match because the C is not followed by a number
    *PX This line should not match because the P is not followed by a number
    * This line should NOT match because the Asterisk is followed by only one Space
    *C456... This line should not match because of the three dots (ie there is no space after the C456)
    

    (Note: I used lazy [0-9], which can give wrong positives (², ³). Best practice is [0123456789] instead)