Search code examples
regexawkexpression

AWK validate Field with Regex


I'm trying to define a proper regular expression that will validate a field.

The field is 26 chars long and can have: any letter (lower or uppercase), whitespaces ( ), commas (,), hyphens (-) or forward slashes (/).

The program should:

Identify whether or not there's an improper char in field $3 via if ( $3 !~ /regex/ ). If that is the case, show the improper chars (in this case: $ and *) via the showChars() function.

Current code:

awk '
   function showChars(fieldIn) {
      split(fieldIn,chars,"")
      for ( i=1; i<=length(chars); i++ ) {
         if (chars[i] !~ regex) {
            print "Invalid char found:" chars[i]
         }
      }
   }

   BEGIN {
      FS=""
      FIELDWIDTHS="4 4 26"
      regex="[a-zA-Z/, \t-]$"
   }

   {
      if ( $3 !~ /regex/ ) {
         print "Line " NR ": Problem in field"
         print "$3:"$3
         showChars($3)
         next
      } else {
         print "Line " NR ": OK"
         next
      }
   }
' $filename

This particular code, enters the if in every case but showChars() doesn't always show invalid chars, makes me wonder why it entered the if in the first place.

Example of input filename: Invalid field: !!!!----JOHN,DOE/-SMITH $* (end of line after 4+4+26 char fields) Valid field: !!!!----ANA,DE/LACROIX (end of line after 4+4+26 char fields)

filename:

!!!!----JOHN,DOE/-SMITH $*        
!!!!----ANA,DE/LACROIX            

Solution

  • This may be what you're trying to do, using GNU for various extensions:

    awk '
       function showchar(fieldIn,   chars,numChars,i) {
          numChars = split(fieldIn,chars,"")
          for ( i=1; i <= numChars; i++ ) {
             if ( chars[i] !~ chrRegex ) {
                print "Invalid char found:" chars[i]
             }
          }
       }
    
       BEGIN {
          FIELDWIDTHS="4 4 26"
          chrRegex = "[[:alpha:][:space:],/-]"
          fldRegex = "^(" chrRegex "){26}$"
       }
    
       {
          if ( $3 ~ fldRegex ) {
             print "Line " NR ": OK"
          }
          else {
             print "Line " NR ": Problem in field"
             print "$3:"$3
             showchar($3)
          }
       }
    ' "$filename"
    

    Your showchar() function could just be this though:

    print "Invalid char(s) found:", gensub(chrRegex,"","g",$3)
    

    e.g.

    $ cat tst.sh
    #!/usr/bin/env bash
    
    filename="$1"
    
    awk '
       BEGIN {
          FIELDWIDTHS="4 4 26"
          chrRegex = "[[:alpha:][:space:],/-]"
          strRegex = "^" chrRegex "{26}$"
       }
    
       {
          if ( $3 !~ strRegex ) {
             print "Line " NR ": Problem in field"
             print "$3:"$3
             print "Invalid char(s) found:", gensub(chrRegex,"","g",$3)
          } else {
             print "Line " NR ": OK"
          }
       }
    ' "$filename"
    

    $ ./tst.sh file
    Line 1: Problem in field
    $3:JOHN,DOE/-SMITH $*
    Invalid char(s) found: $*
    Line 2: OK
    

    Don't write negative conditions if you can avoid it by the way. You had:

    if ( !whatever ) {
        do_foo
    }
    else {
        do_bar
    }
    

    so ask yourself - under what condition do I call do_bar? It's "If it is NOT true that NOT whatever is true" - an inscrutable double negative. Just avoid using !s or other negative logic to keep your code clear and simple:

    if ( whatever ) {
        do_bar
    }
    else {
        do_foo
    }