Search code examples
sortingawksedstring-length

Remove row with certain column length?


I have a text file which looks like this:

A : 1
Boy : 3
Ahoy! : 7
more : 8

I have to remove rows whose length is less than or equal to 3 alphabets long. The output should look like this:

Ahoy! : 7
more : 8

Thanks


Solution

  • The OP is a little bit unspecific and (due to comm(ent|un)ication) there evolved some possible solutions depending on how I interpreted the question.

    My 1st script filter.awk:

    $3 <= 3 { next }
    { print $0 }
    

    considers only the 3rd column (using standard white space separation). Thus, the number after the colon is compared with constant 3.

    Your test input filter.txt:

    A : 1
    Boy : 3
    Ahoy! : 7
    more : 8
    

    Test:

    $ awk -f filter.awk filter.txt
    Ahoy! : 7
    more : 8
    
    $
    

    Ed Morton pointed out that it can be done even shorter:

    $3 > 3
    

    This was new for me also. (May be, I was confused by lex which works the opposite way: In lex/flex, everything unmatched is echoed.)

    A more robust approach would be to use the colon (:) as column separator (or like it is named in awk: field separator). Changing the field separator can be achieved by assigning the built-in variable FS. This can be done using the command line argument -F or by assignment in a special BEGIN rule which is always performed at start. (I prefer the letter to make the scripts "self-contained".)

    Thus, filter2.awk (i.e. filter.awk V2.0):

    BEGIN { FS = ":" }
    $2 <= 3 { next }
    { print $0 }
    

    or considering what I've learnt today:

    BEGIN { FS = ":" }
    $2 > 3
    

    Test:

    $ awk -f filter2.awk filter.txt
    Ahoy! : 7
    more : 8
    
    $
    

    Another interpretation of the OP might be to consider the number of consecutive letters in the first column of every line. To achieve this, some of the built-in functions come into play:

    1. gensub() a powerful replacement function which is unfortunately only in GNU awk available

    2. length() which returns the length of a string (or the number of elements in an array)

    For this I use an extended test input filter2.txt:

    A : 1
    Boy : 3
    Ahoy! : 7
    more : 8
    Hello World : 0
    Hello! World. : 0
    Hi World : 0
    

    filter3.awk (i.e. filter.awk V3.0):

    length(gensub(/(^[A-Za-z]+).*$/, "\\1", 1, $1)) > 3
    

    Test:

    $ awk -f filter3.awk filter2.txt
    Ahoy! : 7
    more : 8
    Hello World : 0
    Hello! World. : 0
    
    $
    

    As the field separator is unchanged in this case, the 1st field consists of the characters until 1st white space. The pattern (^[A-Za-z]+) catches all letters at begin of text and stores them into 1st internal buffer. The .*$ matched the rest until end of text. This whole text is replaced by buffer \1. (Consider the escaped backslash in "\\1".) This works fine on my bash in cygwin because I once defined LANG=C in my bash-initialization (after having trouble with the German locale). Ed Morton (again) pointed out that using [[:alpha:]] instead of [A-Za-z] should be more robust.

    If you have a non-GNU awk, then gensub() is not available. (A few weeks ago, another guru (one with a k in his reputation) teached me that there are rather no other awks than gawks out there in the world. Checking this, I realized that even the awk in our companies Windows VS build chain is actually a gawk. However, since I learnt this I stumbled multiple times over the fact that my answers were not well-excepted because I didn't consider that the solution was explicitly (or implicitly) required for non-GNU awk...)

    So here is my 4th version for non-GNU awk filter4.awk:

    {
      text = $1
      gsub(/[^[:alpha:]].*$/, "", text)
      if (length(text) > 3) { print $0 }
    }
    

    Test:

    $ awk -f filter4.awk filter2.txt
    Ahoy! : 7
    more : 8
    Hello World : 0
    Hello! World. : 0
    
    $
    

    For gsub(), I reverted the logic of reg-ex replacement: Everything from first non-alpha character until end of text is replaced by the empty string. (AFAIK, there even does not exist something like enumerated buffers in gsub().)

    The assignment to temporary variable text is necessary because gsub() modifies the contents of its 3rd argument. If I had provided $1 directly (as I did before fixing it) its contents would be changed which in turn had changed also the contents of $0.