Search code examples
google-cloud-bigtablebigtable

BigTable: Is there a better approach to get unique values from partial row keys?


I've made a bigtable with row key in <name>#<date>#<id_value>

and I'd like to get unique ids when I filter using row key prefix like below.

client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)
prefix = "phone#20190501"
end_key = prefix[:-1] + chr(ord(prefix[-1]) + 1)

# example row keys = ['phone#20190501#<id_value>', 'phone#20190501#<id_value>'...]

row_set = RowSet()
row_set.add_row_range_from_keys(prefix.encode("utf-8"),
                                end_key.encode("utf-8"))

rows = table.read_rows(row_set=row_set)
id_values = []
for row in rows:
    # get last id_value from row key
    id_value = str(row.key).replace('phone#20190501#', '')
    id_values.append(id_value)
unique_id_list = list(set(id_values))
print('COUNT: %s' % len(unique_id_list))

However, I'm wondering that if I read rows over 100 million, I think this way to calculate unique id_value might eats a lot of memory and cpu.

Is there a better way to count unique ids in Bigtable or function like "UNIQUE" in standard SQL


Solution

  • Bigtable doesn't have a way to sort/unique like SQL, it has to be done on the client side - through code. However, there are some performance considerations that can help you. You can find it in the below stack thread posted by DamPlz:

    1. In case query speed is a must, loading the data into BigQuery instead of setting up an external data source would be the most efficient way. Nevertheless, there are some things you can do to improve BigQuery, or BigTable performances.

    2. This connector is still in the Beta stage, and has some performance considerations. We should also take into consideration that BigTable is a noSQL (non relational) database and is not intended for SQL queries. In case you are exploring the data model you want to use in your application, I recommend you consider all these options and choose the one that fits better with your needs.

    3. I would say it is not a good choice if you want to query your data using SQL. Understanding de non relational architecture of BigTable, the most effective way to read your data would be sending read requests. You can find some code samples about this, in different languages in the official documentation.