Search code examples
pythonpandasamazon-web-servicesaws-glueamazon-athena

AWS Athena select query result include data inconstantly


Environment
Collecting rawdata from python3.7 -> boto3 -> S3.
With parquet (serialized with fastparquet)
Using Glue Crawler with non-customize (just create, assign IAM role, s3 target, make empty data-catalog)



Problem
Following athena query returns only until 2020-04-01:

SELECT * FROM "rawdata" where "partition_0" >= '2020-03-29' and "partition_0" <= '2020-04-02';
SELECT * FROM "rawdata" where  ("partition_0" = '2020-03-29' or "partition_0" = '2020-03-30' or "partition_0" = '2020-03-31' or "partition_0" = '2020-04-01' or "partition_0" = '2020-04-02')
#These two queries are same meaning, same result.  

But, If I query to athena fowlling next same time, It returns 2020-04-02.

SELECT * FROM "rawdata" where "partition_0" >= '2020-04-02' and "partition_0" <= '2020-04-02';
SELECT * FROM "rawdata" where "partition_0" = '2020-04-02';
#Also these two queries are same meaning, same result.  



Structure
S3 partitions are folling next format :

bucketname/collectorname/merged/rawdata/yyyy-mm-dd/data.parquet

Glue Crawler has own Data-Catalog that's name is same with Glue Crawler.
And Glue Crawler's target is next

bucketname/collectorname/merged/rawdata

Every Glue Crawler's IAM Role is same, they have two AWS managed policies.

AWSGlueServiceRole
AmazonS3ReadOnlyAccess

The parquet file is saved with pandas.to_parquet with fastparquet, not compressed.
I didn't edit any script in Glue Crawler.



Workflow
A collector collects 3000 rows every 3 mins. (Mon to Fri, 9AM to 15:30PM)
So it merge onto 3000 columned parquet to save to next format

#always next data (because 3 min term, but it is saperated with seconds)
bucketname/collectorname/notmerged/rawdata/yyyy-mm-dd/hh_mm_ss.parquet.bz2

#always overwrited, if someone request latest snapshot, system just use it (not athena)
bucketname/collectorname/cached/latest/data.parquet

And next time, the parser works.
Parser is also made by python3.7 and It's pseudo code are next.

import pandas as pd
import io
import boto3MyCustomWrapped

#download last collected dataframe
dfnew = pd.read_parquet(io.BytesIO(unzip(boto3MyCustomWrapped.s3.get_object(bucket="bucketname",key="collectorname/cached/latest/data.parquet.bz2")))

#today yyyy-mm-dd
strftime_of_today_datetime_yyyy_mm_dd = datetime.datetime.utcnow().strftime('%Y-%m-%d')

#merged data
dfold = pd.read_parquet(io.BytesIO(boto3MyCustomWrapped.s3.get_object(bucket="bucketname",key=f"collectorname/merged/{strftime_of_today_datetime_yyyy_mm_dd}/data.parquet"))

#some process is skipped (like processing if dfold not exist)
dfmerged = pd.concat([dfold, dfnew])

#validate for athena optimize (like column type clean)
dfmerged = validate_and_process_and_sort_and_clean_index(dfmerged)

#upload overwrite onto dfold's s3 path (!!data catalog is connected only this path!!)
boto3MyCustomWrapped.s3.put_object_from_dataframe(bucket="bucketname",
    key=f"collectorname/merged/{strftime_of_today_datetime_yyyy_mm_dd}/data.parquet", dfmerged)

#today's glue crawling
if datetime.datetime.utcnow().hour == 9 and datetime.datetime.utcnow().minute < 10:
    boto3MyCustomWrapped.glue.try_start_crawler('collectorname')



Question
How can I working Athena returns include today with 'including today' query? (not only 'exact today' query)
The problem is Athena result include or not include 'today' data conditional, even every my query contains 'today'.
And I don't know why, and how to fix



More Information

There no error returned Athena all situations, and all returns have normal columns.


Solution

  • Long time watched, That problem never happen again. (I didn't do anything)
    In short finally I can't figured out why it happen.
    I feel little bit inscure about that, but I feeling responsibility of Gabip's help for my question.

    All thing I did it is

    1. Make new data-catalog that's name without '.' (for example, catalog name is "crawling.siteurl.itemtype" and table name is "rawdata", I made a new table with named "crawlingsiteurlitemtype"."rawdata")
    2. Same glue crawling from same source.
    3. show partitions to "crawlingsiteurlitemtype"."rawdata" working good (and it contains 2020-04-02)
    4. show partitions and msck repair dosen't work with my origin table. ("crawling.siteurl.itemtype"."rawdata")
    5. but after 3 and 4, suddenly result of athena query to my origin table contains '2020-04-02' data.
    6. I digged it today, but I can't find what makes it work, and Today, querying to the original data-catalog still returns good response of '2020-04-06(=today)'
    7. So I give up digging how to know, But feel little bit inscure because Someday, It may be break out again.

    Anyway, thank you for answer!