Search code examples
apache-sparkcassandraspark-cassandra-connectoramazon-keyspaces

Problem to write on keyspace with new versions spark 3.x


I'm trying to write on aws keyspace, but the following message appears:

enter image description here

Spark version: 3.0.1
Connector: 3.0
Java: 1.8
Scala: 2.12

Respecting by the version on github: enter image description here

In other previus version like Connector = 2.5.2 and spark = 2.4.6 works fine.


Solution

  • You should be able to connect using spark 3 and connector 3. Here are some steps to validate you setup connection accordingly and you have the right permissions.

    • Make sure you have permissions to read the system tables.
    • If you have setup the VPCE endpoint ensure you have permissions for describe VPC endpoints.
    • In you configuration make sure that host-validation set to false in ssl config.

    You should be able to execute the following query against your system.peers table and retrieve the ips from the endpoint public/private. If you have 1 or no peers you need to take the steps above. Remember the AWS console is not in your vpc and will contact the public endpoint similar to s3.

    SELECT * FROM system.peers
    

    Sample Policy. You need to provide access to resource /keyspace/system* and ec2:DescribeNetworkInterfaces" and "ec2:DescribeVpcEndpoints" on your vpc.

        {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Effect":"Allow",
             "Action":[
                "cassandra:Select",
                "cassandra:Modify"
             ],
             "Resource":[
                "arn:aws:cassandra:us-east-1:111122223333:/keyspace/mykeyspace/table/mytable",
                "arn:aws:cassandra:us-east-1:111122223333:/keyspace/system*"
             ]
          },
          {
             "Sid":"ListVPCEndpoints",
             "Effect":"Allow",
             "Action":[
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeVpcEndpoints"
             ],
             "Resource":"*"
          }
       ]
    }
    

    Setup the connection by referencing the external config.

    -conf":"spark.cassandra.connection.config.profile.path=application.conf"
    

    Sample driver config.

    datastax-java-driver {
      basic.request.consistency = "LOCAL_QUORUM"
      basic.contact-points = [ "cassandra.us-east-1.amazonaws.com:9142"]
    
      advanced.reconnect-on-init = true
    
       basic.load-balancing-policy {
            local-datacenter = "us-east-1"
         }
    
       advanced.auth-provider = {
           class = PlainTextAuthProvider
           username = "user-at-sample"
           password = "S@MPLE=PASSWORD="
        }
    
        advanced.throttler = {
           class = ConcurrencyLimitingRequestThrottler
           max-concurrent-requests = 30
           max-queue-size = 2000
        }
    
    
    
       advanced.ssl-engine-factory {
          class = DefaultSslEngineFactory
          hostname-validation = false
        }
    
        advanced.connection.pool.local.size = 1
    
    
    }