Search code examples
awk

How can you tell which characters are in which character classes?


We often see people ask questions about why their string doesn't match their regexp and sometimes the answer comes down to them expecting a character to be part of a character class when it isn't or them trying to use a shorthand for a character class (e.g. \d for [[:digit:]]) that exists in some other tool but simply isn't part of the awk language. So with that in mind I'm creating a canonical answer to the question of which characters exist in which character classes in awk.


Solution

  • The following script will generate the set of chars in each character class (plus the \s, \S, \w, and \W extensions if your awk supports them) for your locale for the chars in the numeric range 0-127 as listed in the first table at http://www.asciitable.com/ and https://en.wikipedia.org/wiki/ASCII. For a horizontal tab character as output by print "\t" the first reference uses TAB and the other HT as the abbreviation - I prefer TAB so I used it below. They both use Space to represent the char output by print " " so I also did that below even though I more commonly refer to it as a "blank char":

    $ cat prtCharClasses.awk
    # From the gawk manual, https://www.gnu.org/software/gawk/manual/gawk.html#Bracket-Expressions:
    #   [:alnum:]   Alphanumeric characters
    #   [:alpha:]   Alphabetic characters
    #   [:blank:]   Space and TAB characters
    #   [:cntrl:]   Control characters
    #   [:digit:]   Numeric characters
    #   [:graph:]   Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
    #   [:lower:]   Lowercase alphabetic characters
    #   [:print:]   Printable characters (characters that are not control characters)
    #   [:punct:]   Punctuation characters (characters that are not letters, digits, control characters, or space characters)
    #   [:space:]   Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab)
    #   [:upper:]   Uppercase alphabetic characters
    #   [:xdigit:]  Characters that are hexadecimal digits
    #   \s          Matches any whitespace character. Think of it as shorthand for ‘[[:space:]]’.
    #   \S          Matches any character that is not whitespace. Think of it as shorthand for ‘[^[:space:]]’.
    #   \w          Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for ‘[[:alnum:]_]’.
    #   \W          Matches any character that is not word-constituent. Think of it as shorthand for ‘[^[:alnum:]_]’.
    
    BEGIN {
        asciiMax = (asciiMax == "" ? 127 : asciiMax)
    
        numClasses = split("\
            [[:alpha:]]     \
            [[:digit:]]     \
            [[:alnum:]]     \
            [[:lower:]]     \
            [[:upper:]]     \
            [[:xdigit:]]    \
            [[:punct:]]     \
            [[:cntrl:]]     \
            [[:graph:]]     \
            [[:print:]]     \
            [[:blank:]]     \
            [[:space:]]     \
            \\s             \
            \\S             \
            \\w             \
            \\W             \
        ", classes)
    
        # Map the control chars and white space in the 0-127 range to
        # their abbreviations to make them visible in the output:
        split("NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space", map)
        map[128] = "DEL"
    
        for (asciiNr=0; asciiNr<=asciiMax; asciiNr++) {
            char = sprintf("%c", asciiNr)
            chars[++numChars] = char
        }
    
        for (classNr in classes) {
            class = classes[classNr]
            for (charNr in chars) {
                char = chars[charNr]
                if ( char ~ class ) {
                    classChars[classNr,charNr]
                }
            }
        }
    
        for (classNr=1; classNr<=numClasses; classNr++) {
            class = classes[classNr]
            printf "%-12s =", class
            for (charNr=1; charNr<=numChars; charNr++) {
                if ( (classNr,charNr) in classChars ) {
                    char = chars[charNr]
                    printf " %s", (charNr in map ? map[charNr] : char)
                }
            }
            print ""
        }
    }
    

    Here is it's output for chars 0-127 in the C locale, if you have a different locale then the output will be different so run the above script to see what it is in your locale:

    $ awk -f prtCharClasses.awk file
    [[:alpha:]]  = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
    [[:digit:]]  = 0 1 2 3 4 5 6 7 8 9
    [[:alnum:]]  = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
    [[:lower:]]  = a b c d e f g h i j k l m n o p q r s t u v w x y z
    [[:upper:]]  = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    [[:xdigit:]] = 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
    [[:punct:]]  = ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
    [[:cntrl:]]  = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US DEL
    [[:graph:]]  = ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
    [[:print:]]  = Space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
    [[:blank:]]  = TAB Space
    [[:space:]]  = TAB LF VT FF CR Space
    \s           = TAB LF VT FF CR Space
    \S           = NUL SOH STX ETX EOT ENQ ACK BEL BS SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL
    \w           = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
    \W           = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ ` { | } ~ DEL
    

    Note that \s, \S, \w, and \W are extensions only available in some tools, e.g. GNU awk. \d and \D are not present above - those are extensions available in some tools that support PCREs as shorthand for [:digit:] but that does not include any variant of awk. If you want a shorthand for [:digit:] then [0-9] appears to be portable across locales but I stand to be corrected.

    If you need to see the chars past number 127, then you can set asciiMax on the command line, e.g.:

    $ awk -v asciiMax=255 -f prtCharClasses.awk
    [[:alpha:]]  = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
    [[:digit:]]  = 0 1 2 3 4 5 6 7 8 9
    [[:alnum:]]  = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
    [[:lower:]]  = a b c d e f g h i j k l m n o p q r s t u v w x y z µ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
    [[:upper:]]  = A B C D E F G H I J K L M N O P Q R S T U V W X Y Z À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ
    [[:xdigit:]] = 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
    [[:punct:]]  = ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © « ¬ ® ¯ ° ± ² ³ ´ ¶ · ¸ ¹ » ¼ ½ ¾ ¿ × ÷
    [[:cntrl:]]  = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US DEL                                
    [[:graph:]]  = ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
    [[:print:]]  = Space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
    [[:blank:]]  = TAB Space  
    [[:space:]]  = TAB LF VT FF CR Space  
    \s           = TAB LF VT FF CR Space  
    \S           = NUL SOH STX ETX EOT ENQ ACK BEL BS SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL                                 ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
    \w           = 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
    \W           = NUL SOH STX ETX EOT ENQ ACK BEL BS TAB LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US Space ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ ` { | } ~ DEL                                   ¡ ¢ £ ¤ ¥ ¦ § ¨ © « ¬ ­ ® ¯ ° ± ² ³ ´ ¶ · ¸ ¹ » ¼ ½ ¾ ¿ × ÷
    

    Note: if your awk doesn't allow escaping literal newlines then that first numClasses = split(...) which is doing so will produce an error message like the following (courtesy of @Fravadona running nawk on Solaris):

    /usr/bin/nawk: newline in string \... at source line 22
    

    If that happens to you then:

    1. Your awk version probably doesn't support character classes either so you're probably wasting your time running the above script.
    2. If you want to try it anyway, just remove the newlines and backslashes from that first split() so it becomes numClasses = split("[[:alpha:]] [[:digit:]] ... \\w \\W", classes).