I've got a table set up like so:
{"String" : {uuid1 : "String", uuid1: "String"}, "String" : {uuid : "String"}}
Or...
Row_validation_class = UTF8Type
Default_validation_class = UTF8Type
Comparator = UUID
(It's basically got website as a row label, and has dynamically generated columns based on datetime.datetime.now() with TimeUUIDType in Cassandra and a string as the value)
I'm looking to use Pycassa to retrieve slices of the data based on both the row and the columns. However, on other (smaller) tables I've done this but by downloading the whole data set (or at least filtered to one row) and then had an ordered dictionary I could compare with datetime objects.
I'd like to be able to use something like the Pycassa multiget or get_indexed_slice function to pull certain columns and rows. Does something like this exist that allows filtering on datetime. All my current attempts result in the following error message:
TypeError: can't compare datetime.datetime to UUID
The best I've managed to come up with so far is...
def get_number_of_visitors(site, start_date, end_date=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S:%f")):
pool = ConnectionPool('Logs', timeout = 2)
col_fam = ColumnFamily(pool, 'sessions')
result = col_fam.get(site)
number_of_views = [(k,v) for k,v in col_fam.get(site).items() if get_posixtime(k) > datetime.datetime.strptime(str(start_date), "%Y-%m-%d %H:%M:%S:%f") and get_posixtime(k) < datetime.datetime.strptime(str(end_date), "%Y-%m-%d %H:%M:%S:%f")]
total_unique_sessions = len(number_of_views)
return total_unique_sessions
With get_posixtime being defined as:
def get_posixtime(uuid1):
assert uuid1.version == 1, ValueError('only applies to type 1')
t = uuid1.time
t = (t - 0x01b21dd213814000L)
t = t / 1e7
return datetime.datetime.fromtimestamp(t)
This doesn't seem to work (isn't returning the data I'd expect) and also feels like it shouldn't be necessary. I'm creating the column timestamps using:
timestamp = datetime.datetime.now()
Does anybody have any ideas? It feels like this is the sort of thing that Pycassa (or another python library) would support but I can't figure out how to do it.
p.s. table schema as described by cqlsh:
CREATE COLUMNFAMILY sessions (
KEY text PRIMARY KEY
) WITH
comment='' AND
comparator='TimeUUIDType' AND
row_cache_provider='ConcurrentLinkedHashCacheProvider' AND
key_cache_size=200000.000000 AND
row_cache_size=0.000000 AND
read_repair_chance=1.000000 AND
gc_grace_seconds=864000 AND
default_validation=text AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
row_cache_save_period_in_seconds=0 AND
key_cache_save_period_in_seconds=14400 AND
replicate_on_write=True;
p.s.
I know you can specify a column range in Pycassa but I won't be able to guarantee that the start and end values of the range will have entries for each of the rows and hence the column may not exist.
You do want to request a "slice" of columns using the column_start
and column_finish
parameters to get()
, multiget()
, get_count()
, get_range()
, etc. For TimeUUIDType comparators, pycassa actually accepts datetime
instances or timestamps for those two parameters; it will internally convert them to a TimeUUID-like form with a matching timestamp component. There's a section of the documentation dedicated to working with TimeUUIDs that provides more details.
For example, I would implement your function like this:
def get_number_of_visitors(site, start_date, end_date=None):
"""
start_date and end_date should be datetime.datetime instances or
timestamps like those returned from time.time().
"""
if end_date is None:
end_date = datetime.datetime.now()
pool = ConnectionPool('Logs', timeout = 2)
col_fam = ColumnFamily(pool, 'sessions')
return col_fam.get_count(site, column_start=start_date, column_finish=end_date)
You could use the same form with col_fam.get()
or col_fam.xget()
to get the actual list of visitors.
P.S. try not to create a new ConnectionPool()
for every request. If you have to, set a lower pool size.