Search code examples
pythonamazon-web-servicesamazon-athenaamazon-sagemakerpyathena

Pyathena is super slow compared to querying from Athena


I run a query from AWS Athena console and takes 10s. The same query run from Sagemaker using PyAthena takes 155s. Is PyAthena slowing it down or is the data transfer from Athena to sagemaker so time consuming?

What could I do to speed this up?


Solution

  • Just figure out a way of boosting the queries:

    Before I was trying:

    import pandas as pd
    from pyathena import connect
    
    conn = connect(s3_staging_dir=STAGIN_DIR,
                 region_name=REGION)
    pd.read_sql(QUERY, conn)
    # takes 160s
    

    Figured out that using a PandasCursor instead of a connection is way faster

    import pandas as pd
    pyathena import connect
    from pyathena.pandas.cursor import PandasCursor
    
    cursor = connect(s3_staging_dir=STAGIN_DIR,
                     region_name=REGION,
                     cursor_class=PandasCursor).cursor()
    df = cursor.execute(QUERY).as_pandas()
    # takes 12s
    

    Ref: https://github.com/laughingman7743/PyAthena/issues/46