Search code examples
regexterraformterraform-provider-awsaws-glue-data-catalog

What does this AWS pattern "[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*" mean?


While Creating a glue table using terraform I am getting Following error

 at 'table.storageDescriptor.columns.5.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*;

The terraform column type looks like this:

struct<
subscriber_id:string,
price_plan_id:string,
usage_rate_id:string,
entitlement_id:string,
charged_gl:string,
uncharged_gl:string,
audit_trail:array<
    struct<
        time:string,
        outcome_details:array<
            struct<balance_id:array<string>,
                balance_delta:string,
                event_type:string,
                invoice_text:string>>>>>

I have kept indentention for readability. When I change this to minified version like following, then the error disappears.

struct<subscriber_id:string,price_plan_id:string,usage_rate_id:string,entitlement_id:string,charged_gl:string,uncharged_gl:string,audit_trail:array<struct<time:string,outcome_details:array<struct<balance_id:array<string>,balance_delta:double,event_type:string,invoice_text:string>>>>>"

The AWS documentation also tells about this pattern for multiple apis. Following is what mentioned for the Glue table's column type.

Type The datatype of data in the Column.

Type: String

Length Constraints: Minimum length of 0. Maximum length of 131072.

Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*

Required: No

I want to undersatnd this pattern, what does this pattern mean. I tried searching for the unicode charaters mentioned in the pattern but could not identify what characters are represented here. I am familier with regex, but never came across this kind of pattern.


Solution

  • That matches "most" of the Unicode characters; but it pretty explicitly omits basic ASCII control characters and in particular newline (U+000A), which is why the indented version doesn't work.

    [\u0020-\uD7FF] means "any single Unicode character between U+0020 and U+D7FF". That includes all of the printable ASCII range (including space U+0020 and A U+0041), most characters in most languages, many mathematical symbols, and more; [\u0020-\uD7FF\uE000-\uFFFD] is "any character between U+0020 and U+D7FF or between U+E000 and U+FFFD", which is all of the characters defined before they started allocating code points above U+10000. U+D800 through U+DFFF don't exist (even though this regex explicitly includes them), nor do U+FFFE and U+FFFF. The \t at the end is an ASCII tab U+0009.

    This regexp does allow some characters like the non-breaking space U+00A0 and other non-ASCII control characters in the U+0080 through U+009F range. It doesn't allow newer characters like the face emoji starting at U+1F600.

    If you had a regexp matching ASCII letters like [A-Za-z]*, you could rewrite it in Unicode code point syntax as [\u0041-\u005A\u0061-\u007A]*.