Search code examples
amazon-web-servicesamazon-s3aws-glueamazon-athena

How can I correct AWS Glue Crawler/Data Catalog inferring all fields in CSV as strings when they're clearly not?


I have a big CSV text file uploaded weekly to an S3 path partitioned by upload date (maybe not important). The schema of these files are all the same, the formatting is all the same, the naming conventions are all the same. Each file contains ~100 columns and ~1M rows of mixed text/numeric types. The raw data looks like this:

id,date,string,int_values,double_values
"6F87U",2021-03-21,"Text",0,1.1483
"8DU87",2021-03-22,"More text, oh yes",1,2.525
"79LO2",2021-03-23,"Moar, give me moar, text",2,3.485489

When I run a Crawler with everything default, querying with Athena like so:

select * from tb_csv_data

...the results in Athena are thus:

id date string int_values double_values
"6F87U" 2021-03-21 "Text" 0 1.1483
"8DU87" 2021-03-22 "More text oh yes" 1
"79LO2" 2021-03-23 "Moar give me moar text

The problem at this level seems to be with proper detection (read: ignoring) of commas as delimiters within quotation marks. So I have a CSV classifier with the following characteristics that I have attached to the Crawler, I run the Crawler again with the classifier attached, and the resulting table properties are thus:

Input format    org.apache.hadoop.mapred.TextInputFormat
Output format   org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.serde2.OpenCSVSerde

Serde parameters    
quoteChar " 
separatorChar ,

Table properties    
sizeKey 4356512114 
objectCount 3 
UPDATED_BY_CRAWLER crawler-name 
CrawlerSchemaSerializerVersion 1.0 
recordCount 3145398 
averageRecordSize 1384 
CrawlerSchemaDeserializerVersion 1.0 
compressionType none 
columnsOrdered true 
areColumnsQuoted true
delimiter , 
typeOfData file

The resulting table with the same simple Athena query as above seems to be correct:

id date string int_values double_values
6F87U 2021-03-21 Text, yes 0 1.1483
8DU87 2021-03-22 More text, oh yes 1 2.525
79LO2 2021-03-23 Moar, give me moar, text 2 3.485489

The expected automatic inference of data types is supposed to be this (let's simplify and presume the date is correct as a string):

Column name Data type
id string
date string
string string
int_values bigint (or long)
double_values double

...but instead they're all strings!

Column name Data type
id string
date string
string string
int_values string
double_values string

I need this data to be accurately queryable from Athena as it is, where it is, so what can I do without further processing of the raw data? I suppose I could manually adjust the table properties in the Console but is that really correct when I need the entire pipeline to be automated? I also want to avoid having to cast types in queries 80+ times for each field as most of these columns are numeric. What can I do?

Thank you!


Solution

  • The limitation arrives from the serde that you are using in your query. Refer to note section in this doc which has below explanation :

    When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.

    For date type to be detected it has to be in UNIX numeric format, such as 1562112000 according to the doc.