Search code examples
perlshellawkstring-matchingpathname

match pathname within double-zero-byte-separator input file


I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).

The record separator of the duplicated.log output is the zero byte instead of the carriage return \n. Example:

$> tr '\0' '\n' < duplicated.log
         12      dir1/index.htm
         12      dir2/index.htm
         12      dir3/index.htm
         12      dir4/index.htm
         12      dir5/index.htm

         32      dir6/video.m4v
         32      dir7/video.m4v

(in this example, the five files dir1/index.htm, ... and dir5/index.htm have same md5sum and their size is 12 bytes. The other two files dir6/video.m4vand dir7/video.m4v have same md5sum and their content size (du) is 32 bytes.)

As each line is ended by a zero byte (\0) instead of carriage return symbol (\n), blank lines are represented as two successive zero bytes (\0\0).

I use zero byte as line separator because, path-file-name may contain carriage return symbol.

But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log?
(e.g. How to retrieve duplicates of dir1/index.htm?)

I need:

$> ./youranswer.sh  "dir1/index.htm"  < duplicated.log | tr '\0' '\n'
         12      dir1/index.htm 
         12      dir2/index.htm 
         12      dir3/index.htm 
         12      dir4/index.htm 
         12      dir5/index.htm 
$> ./youranswer.sh  "dir4/index.htm"  < duplicated.log | tr '\0' '\n'
         12      dir1/index.htm 
         12      dir2/index.htm 
         12      dir3/index.htm 
         12      dir4/index.htm 
         12      dir5/index.htm 
$> ./youranswer.sh  "dir7/video.m4v"  < duplicated.log | tr '\0' '\n'
         32      dir6/video.m4v 
         32      dir7/video.m4v 

I was thinking about some thing like:

awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte 
     /filepath/ { print $0 }' duplicated.log  

...but filepathmay contain slash symbols / and many other symbols (quotes, carriage return...).

I may have to use perl to deal with this situation...

I am open to any suggestions, questions, other ideas...


Solution

  • You're almost there: use the matching operator ~:

    awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log