Search code examples
javacassandrakubernetesspring-data-cassandra

Issue while updating a record in 3 node cassandra cluster deployed using kubernetes


I have a 3 node Cassandra cluster with Replication factor as 2 and read-write consistency set to QUORUM. We are using Spring data Cassandra. All infrastructure is deployed using Kubernetes.

Now in normal use case many records gets inserted to Cassandra table. Then we try to modify/update one of the record using save method of repo, like below:

ChunkMeta tmpRec = chunkMetaRepository.save(chunkMeta);

After execution of above statement we never see any exception or error. But still this update fails intermittently. That is when we check the record in the DB sometime it gets updated successfully where as other time it fails. Also in the above query when we print tmpRec it contains the updated and correct value. Still in the DB these updated values don't get reflected.

We check the the Cassandra transport TRACE logs on all nodes and found that our queries are getting logged there and are being executed also.

Now another weird observation is all of this works if I am using a single Cassandra node (in Kubernetes) or if we deploy above infra using Ansible (even works for 3 nodes for Ansible).

It looks some issue is specifically with the Kubernetes 3 node deployment of Cassandra. Primarily looks like replication among nodes causing this.

Contents of Docker file:

FROM ubuntu:16.04

RUN apt-get update && apt-get install -y python sudo lsof vim dnsutils net-tools && apt-get clean && \
    addgroup testuser && useradd -g testuser testuser && usermod --password testuser testuser;

RUN mkdir -p /opt/test && \
    mkdir -p /opt/test/data;

ADD jre8.tar.gz /opt/test/
ADD apache-cassandra-3.11.0-bin.tar.gz /opt/test/

RUN chmod 755 -R /opt/test/jre && \
    ln -s /opt/test/jre/bin/java /usr/bin/java && \
    mv /opt/test/apache-cassandra* /opt/test/cassandra;

RUN mkdir -p /opt/test/cassandra/logs;

ENV JAVA_HOME /opt/test/jre
RUN export JAVA_HOME

COPY version.txt /opt/test/cassandra/version.txt

WORKDIR /opt/test/cassandra/bin/

RUN mkdir -p /opt/test/data/saved_caches && \
    mkdir -p /opt/test/data/commitlog && \
    mkdir -p /opt/test/data/hints && \
    chown -R testuser:testuser /opt/test/data && \
    chown -R testuser:testuser /opt/test;

USER testuser

CMD cp /etc/cassandra/cassandra.yml ../conf/conf.yml && perl -p -e 's/\$\{([^}]+)\}/defined $ENV{$1} ? $ENV{$1} : $&/eg; s/\$\{([^}]+)\}//eg' ../conf/conf.yml > ../conf/cassandra.yaml && rm ../conf/conf.yml && ./cassandra -f

Please note conf.yml is basically cassandra.yml file having properties related to Cassandra.


Solution

  • Thanks guys and sorry for delayed reply.

    I found the root cause for this behavior. Actually much later I found out that the Cassandra relies (for column timestamp) on client timestamp. Client means the different pod's (instances of microservice). In my case there were 3 containers running on different hosts. Finally after a lot of struggle and research I figured out that there was slight clock drift among these containers running on different hosts. Later I installed the NTP server on all these hosts which helped us keeping the time in sync across these nodes. Similar to NTP you can also install any time syn server/utility and get away from the problem of nodes clock drift issue.

    Though this helped me and will also help other in keeping node clock in sync. But in certain corner cases I found based on the sync time configured with NTP server there could be instances where you can find 2-3 seconds drift across nodes (as in my case the NTP sync time was 2 seconds). Which can be further reduced by reducing the sync time across nodes.

    But eventually the root cause was only the clock drift across nodes running microservices.