Search code examples
perlasciinon-ascii-charactersnon-printable

PERL to count non-printable characters


I have 100,000's of files that I would like to analyze. Specifically I would like to calculate the percentage of printable characters from a sample of the file of arbitrary size. Some of these files are from mainframes, Windows, Unix, etc. so it is likely that binary and control characters are included.

I started by using the Linux "file" command, but it did not provide enough detail for my purposes. The following code conveys what I am trying to do, but does not always work.

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

This is a test call that works:

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

This is how I intend to call it, and works for one file:

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

This does not work correctly:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

Neither does this:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

Instead of executing the script once for EACH line returned by the find, it executes ONCE for ALL the results.

Thanks in advance.


Research so far:

Pipe and XARGS and separators

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem


Clarification(s):
1.) Desired output: If there are 932 files in a directory, the output would be a 932 line list of file names, the total bytes read from the file and the % that were printable characters.
2.) Many of the files are binary. Script needs to handle embedded binary eol or eof sequences.
3.) Many of the files are large, so I would like to only read the first/last xx bytes. I had been trying to use head -c 256 or tail -c 128 to read either the first 256 bytes or the last 128 bytes respectively. Solution could either work in a pipe line or limit bytes within perl script.


Solution

  • Here is my working solution based on the feedback provided.

    I would appreciate any further feedback on form or more efficient methods:

        #!/usr/bin/perl
    
        use strict;
        use warnings;
    
        # This program receives a file path and name.
        # The program attempts to read the first 2000 bytes.
        # The output is a list of files, the number of bytes
        # actually read and the percent of tbe bytes that are
        # ASCII "printable" aka [\x20-\x7E].
    
        my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);
    
        # loop through each file
        foreach(@ARGV) {
           $file_name = shift or die "Pass the file name on the command line.\n";
    
           # open the file read only with "<" in "<$file_name"
           open(FILE, "<$file_name") or die "Can't open $file_name: $!";
    
           # open each file in binary mode to handle non-printable characters
           binmode FILE;
    
           # try to read 2000 bytes from FILE, save the results in $data and the
           # actual number of bytes read in $n_bytes
           $n_bytes = read FILE, $data, 2000;
    
           $cnt_n_print = 0;
           $cnt_print = 0;
    
           # count the number of non-printable characters
           ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);
    
           $cnt_print = $n_bytes - $cnt_n_print;
           $prc_print = $cnt_print/$n_bytes;
    
           print "$file_name|$n_bytes|$prc_print\n";
           close(FILE);
        }
    

    Here is a sample of how to call the above script:

        find /some/path/to/files/ -type f -exec perl this_script.pl {} +
    

    Here's a list of references I found helpful:

    POSIX Bracket Expressions
    Opening files in binmode
    Read function
    Open file read only