Search code examples
bashshellawksedgrep

How do I delete all the lines that match and one after each of them?


I have a large file and a list of my specific strings. The output should not contain my specific lines and one more after each of them. 2 consecutive matches are impossible due to structure of file that i want to filter. For example,

Specific lines:

'ggg'
'sss'

Input:

'ggg'
'123'
'rrr'
'321'
'sss'
'666'

Output:

'rrr'
'321'

Simple grep -v -A 1 does not work


Solution

  • Assumptions:

    • we are looking for exact line matches, to include white space, punctuation marks and quotes
    • matches can occur on consecutive lines in which case we ignore all matches plus the next non-matching line (NOTE: OP has added a comment stating consecutive line matches are not possible; see end of answer for a simplified awk script)

    General approach:

    • if we find a matching line then we ignore the current line and set a flag to ignore the next line
    • if the flag is set we ignore the current line and clear the flag
    • otherwise we print the current line

    Sample input file:

    $ cat input
    'ggg'                       # match/ignore and 
    '123'                       # ignore
    'rrr'
    '321'
    'sss'                       # match/ignore and 
    '666'                       # ignore
    'aaa' 'ggg' 'xxx'
    '12345'
    'xxx'                       # match/ignore and
    'xxx'                       # match/ignore and
    98352                       # ignore
    'xyz'
    hello world
    

    Sample set of lines to match on (and ignore):

    $ cat lines
    'ggg'              # will not match on the line: 'aaa' 'ggg' 'xxx'
    'sss'
    rrr                # will not match on 'rrr' because of the missing quotes
    'xxx'              # will match on consecutive lines and skip the next non-matching line
    

    NOTE: comments do not exist in files

    One awk idea:

    awk '
    #### 1st file:
    
    FNR==NR { a[$0];  next }       # save line as index in array a[]
    
    #### 2nd file:
    
    $0 in a { skip=1; next }       # if line is an index in array then set the "skip" flag and ignore this line
    
    skip    { skip=0; next }       # if flag is set then clear flag and ignore this line
    
    1                              # otherwise print current line
    ' lines input
    
    ######
    # or as a one-liner
    
    awk 'FNR==NR {a[$0];next} $0 in a {skip=1;next} skip {skip=0;next} 1' lines input
    

    This generates:

    'rrr'
    '321'
    'aaa' 'ggg' 'xxx'
    '12345'
    'xyz'
    hello world
    

    NOTE: if assumptions are wrong and/or this does not work for OP's actual files then we'll need the question updated with a more representative set of data


    OP has added a comment stating consecutive line matches cannot occur. This allows us to simplify the code a bit:

    awk '
    FNR==NR { a[$0];   next }       # 1st file: save line as index in array a[]
    $0 in a { getline; next }       # 2nd file: if line is an index in array then get next line (and ignore) then skip to next input line otherwise ...
    1                               # print current line
    ' lines input
    
    ######
    # or as a one-liner
    
    awk 'FNR==NR {a[$0];next} $0 in a {getline;next} 1' lines input
    

    If we remove one of the 'xxx' lines from the input file this will generate:

    'rrr'
    '321'
    'aaa' 'ggg' 'xxx'
    '12345'
    'xyz'
    hello world