Bash grep -P with a list of regexes from a file

Problem: hundreds of thousands of files in hundreds of directories must be tested against a number of PCRE regexp to count and categorize files and to determine which of regex are more viable and inclusive.

My approach for a single regexp test:

find unsorted_test/. -type f -print0 |
    xargs -0 grep -Pazo '(?P<message>User activity exceeds.*?\:\s+(?P<user>.*?))\s' |
    tr -d '\000' |
    fgrep -a unsorted_test |
    sed 's/^.*unsorted/unsorted/' |
    cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt

find | xargs allows to sidestep "the too many arguments" error for grep

grep -Pazo is the one doing the heavy lifing -P is for PCRE regex -a is to make sure files are read as text and -z -o are simply because it doesn't work otherwise with the filebase I have

tr -d '\000' is to make sure the output is not binary

fgrep -a is to get only the line with the filename

sed is to counteract sure the grep's awesome habit of appending trailing lines to each other (basically removes everything in a line before the filepath)

cut -d: -f1 cuts off the filepath only

wc -l counts the result size of the matched filelist

Result is a file with 10k+ lines like these: unsorted/./2020.03.02/68091ec4-cf04-4843-a4b2-95420756cd53 which is what I want in the end.

Obviously this is not very good, but this works fine for something made out of sticks and dirt. My main objective here is to test concepts and regex, not count for further scaling or anything, really.

So, since grep -P does not support -f parameter, I tried using the while read loop:

(while read regexline ;
    do echo "$regexline" ;
    find unsorted_test/. -type f -print0 |
    xargs -0 grep -Pazo "$regexline" |
    tr -d '\000' |
    fgrep -a unsorted_test |
    sed 's/^.*unsorted/unsorted/' |
    cut -d: -f1 > matched_files_unsorted_test000.txt ;
    wc -l matched_files_unsorted_test000.txt |
    sed 's/^ *//' ;
done) < regex_1.txt

And as you can imagine - it fails spectacularly: zero matches for everything.

I've experimented with the quotemarks in the grep, with the loop type etc. Nothing.

Any help with the current code or suggestions on how to do this otherwise are very appreciated. Thank you.

P.S. Yes, I've tried pcregrep, but it returns zero matches even on a single pattern. Dunno why.

Solution

You could do this which will be impossible slow:

find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
     while IFS= read -r regexline; do
        grep -Pazo "$regexline" "$file"
    done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabla

Or for each line:

find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
    while IFS= read -r line; do
         while IFS= read -r regexline; do
             if grep -Pazo "$regexline" <<<"$line"; then
                  break
             fi
        done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabl

Or maybe with xargs.

But I believe just join the regular expressions from the file with |:

find unsorted_test/. -type f -print0 |
{
    regex=$(< regex_1.txt paste -sd '|')
    # or maybe with braces
    # regex=$(< regex_1.txt sed 's/.*/(&)/' | paste -sd '|')
    xargs -0 grep -Pazo "$regex"
} |
....

Notes:

To read lines from file use IFS= read -r line. The -d '' option to read is bash syntax.
Lines with spaces, tabs and comments only after pipe are ignored. You can just put your commands on separate lines.
Use grep -F instead of deprecated fgrep.