Search code examples
pysparkaws-gluespark-cassandra-connectoramazon-keyspaces

Connect glue job to Amazon keyspaces


I'm trying to connect AWS glue job to Amazon keyspaces. Is there anyway to connect and work on those tables using pyspark. PS: I can't use AWS cli due to organization restrictions.


Solution

  • You can connect AWS Glue with Amazon Keyspaces by leveraging the open source spark cassandra connector.

    First, You will need to enable murmur3 partitioner or random partitioner for your account.

    UPDATE system.local set partitioner='org.apache.cassandra.dht.Murmur3Partitioner' where key='local';
    

    Second, make sure you understand the capacity required. By default, Keyspaces tables are created with OnDemand mode which learns required capacity by doubling resources based on your previous traffic peek. Newly created tables have ability to perform 4000 WCU/per sec and 12,000 RCU/per sec. If you need higher capacity create your table in provisioned mode with the desired throughput, then switch to on-demand mode.

    Third, find our prebuilt examples in our samples repositories. We have patterns for export, import, count, and top-N. The examples show how to load the spark cassandra connector to s3, setup best practices for data loading. The following snippet shows export to s3.

    val spark: SparkContext = new SparkContext(conf)
        val glueContext: GlueContext = new GlueContext(spark)
        val sparkSession: SparkSession = glueContext.getSparkSession
    
        import com.datastax.spark.connector._
        import org.apache.spark.sql.cassandra._
        import sparkSession.implicits._
    
        Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
        val tableName = args("TABLE_NAME")
        val keyspaceName = args("KEYSPACE_NAME")
        val backupS3 = args("S3_URI")
        val backupFormat = args("FORMAT")
    
        val tableDf = sparkSession.read
          .format("org.apache.spark.sql.cassandra")
          .options(Map( "table" -> tableName, "keyspace" -> keyspaceName))
          .load()
    
        tableDf.write.format(backupFormat).mode(SaveMode.ErrorIfExists).save(backupS3)
    
        Job.commit()
      }
    }
    

    Best practice is to use rate limiting with Glue per DPU/Worker. Understand the throughput you want per achieve per DPU and set the the throttler in the cassandra driver settings.

     advanced.throttler = {
          class = RateLimitingRequestThrottler
          max-requests-per-second = 1000
          max-queue-size = 50000
          drain-interval = 1 millisecond
        }
    

    You will want to ensure that you have proper IAM permissions to access Amazon Keyspaces. If your using a VPC endpoint you will also want to include privileges here.