Search code examples
cassandradsbulk

DSBulk cannot connect to cluster to load CSV data


I am trying to load csv files into cassandra cluster for which I am using dsbulk utility.I have a local copy of CSV file and trying to connect to remote cluster and load the CSV into the table. However, dsbulk is failing to recognise remote cluster address and saying

Could not reach any contact point, make sure you've provided valid addresses

and

Caused by: An existing connection was forcibly closed by the remote host.

I am using the same connection parameters from intellij to connect to cluster with sslenabled and it is working fine. Couldn't really figure why it is not working with dsbulk. Please find the application.conf for dsbulk and the command that I am trying to run

dsbulk {
  --dsbulk.connector.name = csv
  --dsbulk.connector.csv.url = <CSV_Path>
  --dsbulk.connector.csv.header true
  --datastax-java-driver.basic.contact-points = [ "169.XX.XXX.XX", "169.XX.XXX.XX", "169.XX.XXX.XX" ]
  --datastax-java-driver.advanced.auth-provider.username = <user_name>
  --datastax-java-driver.advanced.auth-provider.password = <pwd
  --dsbulk.schema.keyspace = <key space
  --dsbulk.schema.table = <table
  --datastax-java-driver.advanced.ssl-engine-factory.truststore-path = <cacerts path<br/>
  --datastax-java-driver.advanced.ssl-engine-factory.truststore-password = <pwd
  --datastax-java-driver.advanced.resolve-contact-points = true
}

commands :

$ dsbulk load -url CSV Path**

The above command doesn't recognize the application.conf properties and trying to connect to 127.0.0.1

Error :

[driver] Error connecting to Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=2c61adb4)

Not really sure what could be the issue for conf file not being used by dsbulk

$ dsbulk load -url CSV Path -k keysapce -t table -h "[ "169.XX.XXX.XX", "169.XX.XXX.XX", "169.XX.XXX.XX" ]" -u userName -p pwd

The above command fails to connect to the cluster nodes added explicitly. Error :

[driver] Error connecting to Node(endPoint=/169.XX.XXX.XX:9042, hostId=null, hashCode=2a38b2fe),
Suppressed: [driver|control|id: 0x17d0139b, L:/172.31.50.184:59702 - R:/169.XX.XXX.XXX:9042] Protocol initialization request, step 1 (OPTIONS): unexpected failure (com.datastax.oss.driver.api.core.connection.ClosedConnectionException: Unexpected error on channel).
     Caused by: Unexpected error on channel.
       Caused by: An existing connection was forcibly closed by the remote host.

dsbulk is retrying on all nodes and giving the same error.

Auth is redirecting to plain text which I believe will work for my use case

Username and password provided but auth provider not specified, inferring PlainTextAuthProvider

Could you please suggest on what is the problem with my config or my connection to the remote cluster.

My actual use case is to archive millions of records from Sybase to Cassandra every week for which I am trying to create a simple java utility that executes this dsbulk. Any other approach is also appreciated.

Many thanks in advance.


Solution

  • The problem is that you have not formatted the entries in the configuration file correctly so DSBulk cannot parse them. Since the configuration file is not usable, DSBulk defaults to connecting to localhost (127.0.0.1).

    The correct format looks like this:

    dsbulk {
       connector.name = csv
       schema.keyspace = "keyspacename"
       schema.table = "tablename"
    }
    

    Then you need to define the Java driver options separately which looks like this:

    datastax-java-driver {
      basic {
        contact-points = [ "cp1", "cp2", "cp3"]
      }
      advanced {
        ssl-engine-factory {
          keystore-password = "keystorepass"
          keystore-path = "/path/to/keystore.file"
          class = DefaultSslEngineFactory
          truststore-password = "truststorepass"
          truststore-path = "/path/to/truststore.file"
        }
      }
    }
    

    If you don't configure SSL correctly then the driver will not be able to connect to any on the nodes which is the reason for those errors you mentioned.

    Note that you can place the Java driver configuration in a separate driver.conf file but you need to make sure you reference it in the application configuration with the line:

    include classpath("/path/to/driver.conf")
    

    For details, see Using SSL with DSBulk. Cheers!