I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).
The record separator of the duplicated.log
output is the zero byte instead of the carriage return \n
. Example:
$> tr '\0' '\n' < duplicated.log
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
32 dir6/video.m4v
32 dir7/video.m4v
(in this example, the five files dir1/index.htm
, ... and dir5/index.htm
have same md5sum
and their size is 12 bytes. The other two files dir6/video.m4v
and dir7/video.m4v
have same md5sum
and their content size (du
) is 32 bytes.)
As each line is ended by a zero byte (\0
) instead of carriage return symbol (\n
), blank lines are represented as two successive zero bytes (\0\0
).
I use zero byte as line separator because, path-file-name may contain carriage return symbol.
But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log
?
(e.g. How to retrieve duplicates of dir1/index.htm
?)
I need:
$> ./youranswer.sh "dir1/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir4/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir7/video.m4v" < duplicated.log | tr '\0' '\n'
32 dir6/video.m4v
32 dir7/video.m4v
I was thinking about some thing like:
awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte
/filepath/ { print $0 }' duplicated.log
...but filepath
may contain slash symbols /
and many other symbols (quotes, carriage return...).
I may have to use perl
to deal with this situation...
I am open to any suggestions, questions, other ideas...
You're almost there: use the matching operator ~
:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log