I have below Pyspark code to validate the field in nested json -
"CASE WHEN array_contains(transform(RECORDS_003.DATA.TOTAL_CHARGE, x -> trim(x)), '') OR exists(RECORDS_003.DATA.TOTAL_CHARGE, x -> x IS NULL) THEN 'TOTAL_CHARGE is missing' ELSE NULL END",
In the DATA is the array of json where each json has TOTAL_CHARGE field. The above validation is working fine and giving me expected result. But now I want to check regex(amount regex) pattern on each of the array elements, and if any of the value is not matching the regex then return as False.
Was trying with below code -
"CASE WHEN array_contains(RECORDS_003.DATA.TOTAL_CHARGE, regexp_extract(RECORDS_003.DATA.TOTAL_CHARGE, r'^[-+]?[0-9]*\.?[0-9]+$', 0)) THEN 'TOTAL_CHARGE is invalid' ELSE NULL END"
But it is giving following error -
AnalysisException: cannot resolve 'regexp_extract(df_view.RECORDS_003.`DATA`.`TOTAL_CHARGE`, '^[-+]?[0-9]*\\.?[0-9]+$', 0)' due to data type mismatch: argument 1 requires string type, however, 'df_view.RECORDS_003.`DATA`.`TOTAL_CHARGE`' is of array<string> type.; line 1 pos 56;
Can anyone please help me to know how I can check regex on each array value ?
Following validation worked for me -
"CASE WHEN exists(RECORDS_003.DATA.TOTAL_CHARGE, x -> x NOT LIKE '^[-+]?[0-9]*\.?[0-9]+$') THEN 'TOTAL_CHARGE is invalid' ELSE NULL END"