Search code examples
pythonpostgresqlapache-sparkpysparkaws-glue

Pyspark filter results before loading from Postgres (do not load entire table first)


I am trying to migrate a large amount of data from an RDS Postgres instance in a VPC to a redshift cluster in the same VPC. I am trying to do this using PySpark and AWS Glue. I only want to migrate the last 6 months of data, however my query seems to be executing a load of the entire table in question, then filter it, which is causing memory failures. Here is the code I have so far:

from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext

sc = SparkContext()
sc.setLogLevel('WARN')
glueContext = GlueContext(sc)
spark = glueContext.spark_session

datasource0 = glueContext.create_dynamic_frame.from_catalog(database="db", table_name="table")
datasource0.printSchema()

filtered_dyF = Filter.apply(frame = datasource0, f = lambda x: x["scandate"] > "2020-05-31")
print(filtered_dyF.count())

Is there any way I can apply that filter on the load query instead? This path currently attempts to select * from table and I would like it to instead select * from table where scandate > "2020-05-31"


Solution

  • I ended up just using AWS Database Migration Service. Was actually pretty painless