Search code examples
cassandraspark-cassandra-connector

How to prevent short and byte getting promoted to int when creating a cassandra table using the spark-cassandra-connector?


we are using this code to create a cassandra table:

df.createCassandraTable(
        keyspace,
        table,
        partitionKeyColumns = partitionKeyColumns,
        clusteringKeyColumns = clusteringKeyColumns)

where df is a org.apache.spark.DataFrame but we find that the created table does not use same data types as the DataFrame. Specifically we have some columns in the dataframe which are of type short (aka smallint) and byte (aka tinyint) and they get promoted to int in the cassandra table. we do not want this behavior. how can we fix this?

EDIT: making some edits to document our investigations. the call stack when createCassandraTable is called seems to hit this code which would promote byte to int if the com.datastax.driver.core.ProtocolVersion is less than V4:

case ByteType => if (protocolVersion >= V4) TinyIntType else IntType

However we have verified in our logs that we are indeed using V4 of the protocol.

17/05/24 17:43:42 INFO com.myApp$: com.datastax.driver.core.ProtocolVersion = V4 
17/05/24 17:43:42 INFO com.myApp$: ProtocolVersion.NEWEST_SUPPORTED = V4 

Our cassandra cluster is

cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.0.11 | CQL spec 3.4.0 | Native protocol v4]

and we use

<dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector_2.11</artifactId>
            <version>2.0.0-M3</version>
        </dependency>

in our dependencies.

DataFrameFunctions.scala
Schema.scala
DataFrameColumnMapper.scala
ColumnType.scala


Solution

  • the 2.0.0-M3 version of spark-cassandra-connector_2.11 does not have these changes. the solution was to upgrade to 2.0.2 like below:

    <dependency>
         <groupId>com.datastax.spark</groupId>
         <artifactId>spark-cassandra-connector_2.11</artifactId>
         <version>2.0.2</version>
    </dependency>
    

    and that fixed the problem