Search code examples
cassandracql

Mismatch in number of rows imported into cassandra table (COPY command)


I am trying to dump csv file into cassandra table using COPY command. But, number of rows in my csv file and number of rows in cassandra is not consistent.

Number of rows in CSV files : 49765 (excluding header)

Number of rows in cassandra table:

cqlsh:test_df> select Count(*) from test_table;

 count
-------
 46982

(1 rows)

Warnings :
Aggregation query used without partition key

copy command :

COPY test_table (column1,column2,column3) from 'temp.csv'  with delimiter = ',' and header = True;

Error:

Starting copy of test_df.test_bhavcopy with columns [symbol, instrument, expiry_dt, strike_pr, option_typ, open, high, low, close, settle_pr, contracts, val_inlakh, open_int, ch_in_oi, price_date, key].
Process ImportProcess-3:ate:    8387 rows/s; Avg. rate:    3937 rows/s
Traceback (most recent call last):
P rocess ImportProcess-2:
 File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
Traceback (most recent call last):
Process ImportProcess-1:
T raceback (most recent call last):
  File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
 File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
    self.run()
    File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
   self.run()
   self.run()
   File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
 File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
    self.close()
  File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
    self._session.cluster.shutdown()
      self.close()
 File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
   self.close()
   File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
 File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
     self._session.cluster.shutdown()
   self._session.cluster.shutdown()
   File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
 File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
    self.control_connection.shutdown()
  File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
    self._connection.close()
  File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
    AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
  File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
    cls._loop.add_timer(timer)
A ttributeError: 'NoneType' object has no attribute 'add_timer'
   self.control_connection.shutdown()
   File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
   self.control_connection.shutdown()
     self._connection.close()
 File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
   File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
   self._connection.close()
  File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
    AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
     AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
 File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
   File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
   cls._loop.add_timer(timer)
 A   cls._loop.add_timer(timer)
ttributeError: 'NoneType' object has no attribute 'add_timer'
AttributeError: 'NoneType' object has no attribute 'add_timer'
Processed: 49765 rows; Rate:    4193 rows/s; Avg. rate:    3906 rows/s
49765 rows imported from 1 files in 12.742 seconds (0 skipped).

Maybe its due to this error.


Solution

  • Found a fix : I edited my asyncorereactor.py in

    cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py
    

    to self.create_timer() from AsyncoreConnection.create_timer() as suggested in this post

    https://datastax-oss.atlassian.net/browse/PYTHON-862?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel