While Creating a glue table using terraform I am getting Following error
at 'table.storageDescriptor.columns.5.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*;
The terraform column type looks like this:
struct<
subscriber_id:string,
price_plan_id:string,
usage_rate_id:string,
entitlement_id:string,
charged_gl:string,
uncharged_gl:string,
audit_trail:array<
struct<
time:string,
outcome_details:array<
struct<balance_id:array<string>,
balance_delta:string,
event_type:string,
invoice_text:string>>>>>
I have kept indentention for readability. When I change this to minified version like following, then the error disappears.
struct<subscriber_id:string,price_plan_id:string,usage_rate_id:string,entitlement_id:string,charged_gl:string,uncharged_gl:string,audit_trail:array<struct<time:string,outcome_details:array<struct<balance_id:array<string>,balance_delta:double,event_type:string,invoice_text:string>>>>>"
The AWS documentation also tells about this pattern for multiple apis. Following is what mentioned for the Glue table's column type.
Type The datatype of data in the Column.
Type: String
Length Constraints: Minimum length of 0. Maximum length of 131072.
Pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*
Required: No
I want to undersatnd this pattern, what does this pattern mean. I tried searching for the unicode charaters mentioned in the pattern but could not identify what characters are represented here. I am familier with regex, but never came across this kind of pattern.
That matches "most" of the Unicode characters; but it pretty explicitly omits basic ASCII control characters and in particular newline (U+000A), which is why the indented version doesn't work.
[\u0020-\uD7FF]
means "any single Unicode character between U+0020 and U+D7FF". That includes all of the printable ASCII range (including space U+0020 and A U+0041), most characters in most languages, many mathematical symbols, and more; [\u0020-\uD7FF\uE000-\uFFFD]
is "any character between U+0020 and U+D7FF or between U+E000 and U+FFFD", which is all of the characters defined before they started allocating code points above U+10000. U+D800 through U+DFFF don't exist (even though this regex explicitly includes them), nor do U+FFFE and U+FFFF. The \t
at the end is an ASCII tab U+0009.
This regexp does allow some characters like the non-breaking space U+00A0 and other non-ASCII control characters in the U+0080 through U+009F range. It doesn't allow newer characters like the face emoji starting at U+1F600.
If you had a regexp matching ASCII letters like [A-Za-z]*
, you could rewrite it in Unicode code point syntax as [\u0041-\u005A\u0061-\u007A]*
.