we are using this code to create a cassandra table:
df.createCassandraTable(
keyspace,
table,
partitionKeyColumns = partitionKeyColumns,
clusteringKeyColumns = clusteringKeyColumns)
where df
is a org.apache.spark.DataFrame
but we find that the created table does not use same data types as the DataFrame
. Specifically we have some columns in the dataframe which are of type short
(aka smallint
) and byte
(aka tinyint
) and they get promoted to int
in the cassandra table. we do not want this behavior. how can we fix this?
EDIT: making some edits to document our investigations. the call stack when createCassandraTable
is called seems to hit this code which would promote byte
to int
if the com.datastax.driver.core.ProtocolVersion
is less than V4
:
case ByteType => if (protocolVersion >= V4) TinyIntType else IntType
However we have verified in our logs that we are indeed using V4 of the protocol.
17/05/24 17:43:42 INFO com.myApp$: com.datastax.driver.core.ProtocolVersion = V4
17/05/24 17:43:42 INFO com.myApp$: ProtocolVersion.NEWEST_SUPPORTED = V4
Our cassandra cluster is
cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.0.11 | CQL spec 3.4.0 | Native protocol v4]
and we use
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.0-M3</version>
</dependency>
in our dependencies.
DataFrameFunctions.scala
Schema.scala
DataFrameColumnMapper.scala
ColumnType.scala
the 2.0.0-M3
version of spark-cassandra-connector_2.11
does not have these changes. the solution was to upgrade to 2.0.2
like below:
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.0.2</version>
</dependency>
and that fixed the problem