Search code examples
linuxbashcommand-linepcregrep

How to check if one file is part of other?


I need to check if one file is inside another file by bash script. For a given multiline pattern and input file.

Return value:

I want to receive status (how in grep command) 0 if any matches were found, 1 if no matches were found.

Pattern:

  • multiline,
  • order of lines is important (treated as a single block of lines),
  • includes characters such as numbers, letters, ?, &, *, # etc.,

Explanation

Only the following examples should found matches:

pattern     file1 file2 file3 file4
222         111   111   222   222
333         222   222   333   333
            333   333         444
            444

the following should't:

pattern     file1 file2 file3 file4 file5 file6 file7
222         111   111   333   *222  111   111   222
333         *222  222   222   *333  222   222   
            333   333*        444   111         333
            444                     333   333 

Here's my script:

#!/bin/bash

function writeToFile {
    if [ -w "$1" ] ; then
        echo "$2" >> "$1"
    else
        echo -e "$2" | sudo tee -a "$1" > /dev/null
    fi
}

function writeOnceToFile {
        pcregrep --color -M "$2" "$1"
        #echo $?

        if [ $? -eq 0 ]; then
            echo This file contains text that was added previously
        else
            writeToFile "$1" "$2"
        fi
}

file=file.txt 
#1?1
#2?2
#3?3
#4?4

pattern=`cat pattern.txt`
#2?2
#3?3

writeOnceToFile "$file" "$pattern"

I can use grep command for all lines of pattern, but it fails with this example:

file.txt 
#1?1
#2?2
#=== added line
#3?3
#4?4

pattern.txt
#2?2
#3?3

or even if you change lines: 2 with 3

file=file.txt 
#1?1
#3?3
#2?2
#4?4

returning 0 when it should't.

How do I can fix it? Note that I prefer to use native installed programs (if this can be without pcregrep). Maybe sed or awk can resolve this problem?


Solution

  • I have a working version using perl.

    I thought I had it working with GNU awk, but I didn't. RS=empty string splits on blank lines. See the edit history for the broken awk version.

    How can I search for a multiline pattern in a file? shows how to use pcregrep, but I can't see a way to get it to work when the pattern to search may contain regex special characters. -F fixed-string mode doesn't usefully work with multi-line mode: it still treats the pattern as a set of lines to be matched separately. (Not as a multi-line fixed-string to be matched.) I see you were already using pcregrep in your attempt.

    BTW, I think you have a bug in your code in the non-sudo case:

    function writeToFile {
        if [ -w "$1" ] ; then
            "$2" >> "$1"   # probably you mean  echo "$2" >> "$1"
        else
            echo -e "$2" | sudo tee -a "$1" > /dev/null
        fi
    }
    

    Anyway, attempts at using line-based tools have met with failure, so it's time to pull out a more serious programming language that doesn't force the newline convention on us. Just read both files into variables, and use a non-regex search:

    #!/usr/bin/perl -w
    # multi_line_match.pl  pattern_file  target_file
    # exit(0) if a match is found, else exit(1)
    
    #use IO::File;
    use File::Slurp;
    my $pat = read_file($ARGV[0]);
    my $target = read_file($ARGV[1]);
    
    if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
        exit(0);
    }
    exit(1);
    

    See What is the best way to slurp a file into a string in Perl? to avoid the dependency on File::Slurp (which isn't part of the standard perl distro, or a default Ubuntu 15.04 system). I went for File::Slurp partly for readability of what the program is doing, for non-perl-geeks, compared to:

    my $contents = do { local(@ARGV, $/) = $file; <> };
    

    I was working on avoiding reading the full file into memory, with an idea from http://www.perlmonks.org/?node_id=98208. I think non-matching cases would usually still read the whole file at once. Also, the logic was pretty complex for handling a match at the front of the file, and I didn't want to spend a long time testing to make sure it was correct for all cases. Here's what I had before giving up:

    #IO::File->input_record_separator($pat);
    $/ = $pat;  # pat must include a trailing newline if you want it to match one
    
    my $fh = IO::File->new($ARGV[2], O_RDONLY)
        or die 'Could not open file ', $ARGV[2], ": $!";
    
    $tail = substr($fh->getline, -1);  #fast forward to the first match
    #print each occurence in the file
    #print IO::File->input_record_separator  while $fh->getline;
    
    #FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
    do {
        # fixme: need to check defined($fh->getline)
        if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
        exit(0);  # if there's a 2nd line
        }
    } while($tail);
    
    exit(1);
    $fh->close;
    

    Another idea was to filter patterns and files to be searched through tr '\n' '\r' or something, so they would all be single-lines. (\r being a likely safe choice that wouldn't collide with anything already in a file or a pattern.)