Problem: hundreds of thousands of files in hundreds of directories must be tested against a number of PCRE regexp to count and categorize files and to determine which of regex are more viable and inclusive.
My approach for a single regexp test:
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo '(?P<message>User activity exceeds.*?\:\s+(?P<user>.*?))\s' |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt
find | xargs
allows to sidestep "the too many arguments" error for grep
grep -Pazo
is the one doing the heavy lifing -P
is for PCRE regex -a
is to make sure files are read as text and -z -o
are simply because it doesn't work otherwise with the filebase I have
tr -d '\000'
is to make sure the output is not binary
fgrep -a
is to get only the line with the filename
sed
is to counteract sure the grep's awesome habit of appending trailing lines to each other (basically removes everything in a line before the filepath)
cut -d: -f1
cuts off the filepath only
wc -l
counts the result size of the matched filelist
Result is a file with 10k+ lines like these: unsorted/./2020.03.02/68091ec4-cf04-4843-a4b2-95420756cd53
which is what I want in the end.
Obviously this is not very good, but this works fine for something made out of sticks and dirt. My main objective here is to test concepts and regex, not count for further scaling or anything, really.
So, since grep -P
does not support -f
parameter, I tried using the while read
loop:
(while read regexline ;
do echo "$regexline" ;
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo "$regexline" |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt |
sed 's/^ *//' ;
done) < regex_1.txt
And as you can imagine - it fails spectacularly: zero matches for everything.
I've experimented with the quotemarks in the grep, with the loop type etc. Nothing.
Any help with the current code or suggestions on how to do this otherwise are very appreciated. Thank you.
P.S. Yes, I've tried pcregrep, but it returns zero matches even on a single pattern. Dunno why.
You could do this which will be impossible slow:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r regexline; do
grep -Pazo "$regexline" "$file"
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabla
Or for each line:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r line; do
while IFS= read -r regexline; do
if grep -Pazo "$regexline" <<<"$line"; then
break
fi
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabl
Or maybe with xargs.
But I believe just join the regular expressions from the file with |
:
find unsorted_test/. -type f -print0 |
{
regex=$(< regex_1.txt paste -sd '|')
# or maybe with braces
# regex=$(< regex_1.txt sed 's/.*/(&)/' | paste -sd '|')
xargs -0 grep -Pazo "$regex"
} |
....
Notes:
IFS= read -r line
. The -d ''
option to read
is bash syntax.grep -F
instead of deprecated fgrep
.