Search code examples
pythoncassandrapycassa

Cassandra buffered read of millions of columns


I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.

Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get, it's just so you can get the idea):

results = {}
start = 0
while True:
    # Fetch blocks of size 500
    buffer = column_family.get(key, column_offset=start, column_count=500)
    if len(buffer) == 0:
        break

    # Merge these results into the main one
    results.update(buffer)

    # Update the offset
    start += len(buffer)

Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don't actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn't work for any of the values in the middle.

So how can I accomplish a buffered read of all the columns in a single row? Thanks.


Solution

  • From the pycassa 1.0.8 documentation

    it would appear that you could use something like the following [pseudocode]:

    results = {}
    start = 0
    startColumn = ""
    while True:
        # Fetch blocks of size 500
    
       buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
       # iterate returned values. 
       # set startColumn == previous column_finish. 
    

    Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.