I'm new to Bigtable and I've been testing out the filtering features based on this documentation. https://cloud.google.com/bigtable/docs/using-filters
I've tried this in this repo under testcsvbigtablefilters.py which I have some problems. For the record, I am testing this on Bigtable Emulator on my local machine https://github.com/limjix/GoogleCloudBigDataTest
I have some issues with 2 filters:
rows = table.read_rows(filter_=row_filters.CellsColumnLimitFilter(10))
for row in rows:
print_row(row)
rows = table.read_rows(
filter_=row_filters.ValueRangeFilter(start_value=b'0',end_value=b'3'))
for row in rows:
print_row(row)
Also, I want to ask, how do I query for cells that have been overwritten? I know bigtable saves the mutation of cells over a period of time. How do I query and filter for a specific time for that cell?
Any help would be appreciated guys, there isn't much tutorials or documentation anywhere else so I hope the community can help.
Thank you!
Hey limjix thanks for the questions! The Bigtable filtering documentation has just been introduced, so questions like this can help us improve it moving forward.
CellsColumnLimitFilter
The CellsColumnLimitFilter
limits the number of cells in each column that are included in the output row. In the documentation it is listed as cells per column filter, so I can see how this function name would be a bit confusing.
If you only have a row with a few columns that only have one cell or one version for those values, then CellsColumnLimitFilter
would return all of them. If you're looking to only receive one of the column's data you can use the CellsRowLimitFilter
which filters on cells per row. Or you could specify specific columns with any of the column qualifier filters.
ValueRangeFilter
I did some digging and I believe I know what your issue here is, but I'm not 100%. And am happy to troubleshoot further with you if you need.
It looks like you set the cell values directly from the CSV:
bigtablerow.set_cell(column_family_id,
"column1",
#str(float(csvrow[20])+i),
csvrow[20],
timestamp=datetime.datetime.utcnow())
This works fine for string, but if you set a cell value to a number, the python client will treat that as an incrementable value and encode it as a 64-bit big-endian signed integer and will not be comparable with the b'10000' that you have.
For example this code wont return any rows:
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)
rows = []
for i in range(10):
row_key = 'test_num{}'.format(i).encode()
row = table.direct_row(row_key)
row.set_cell("cf".encode(),
"col".encode(),
random.randint(10000, 14000)
)
rows.append(row)
table.mutate_rows(rows)
rows = table.read_rows(
filter_=row_filters.ValueRangeFilter(start_value=b'10000',
end_value=b'14000'))
for row in rows:
print(row)
But when I turn the integer into a string I get all the rows
row.set_cell("cf".encode(),
"col".encode(),
str(random.randint(10000, 14000))
)
rows.append(row)
I would recommend using the CBT tool to check out what the data looks like. For example, when I do cbt read
the first rows look like this:
test_num0
cf:col @ 2020/09/20-22:24:43.835000
"\x00\x00\x00\x00\x00\x003\xd1"
----------------------------------------
test_num1
cf:col @ 2020/09/20-22:24:43.835000
"\x00\x00\x00\x00\x00\x001t"
whereas the string ones look like this (they have two values since I used the same rowkeys actually, but you should get the point):
test_num0
cf:col @ 2020/09/20-22:36:12.075000
"11510"
cf:col @ 2020/09/20-22:24:43.835000
"\x00\x00\x00\x00\x00\x003\xd1"
----------------------------------------
test_num1
cf:col @ 2020/09/20-22:36:12.075000
"12048"
cf:col @ 2020/09/20-22:24:43.835000
"\x00\x00\x00\x00\x00\x001t"
This is similar to what you were trying to do in the first example. For cells that have been overwritten you're looking for filters around cell timestamps (TimestampRangeFilter
) or number of cells per column (CellsColumnLimitFilter
again).