Search code examples
pythongoogle-cloud-bigtablehappybase

Scanning Bigtable by prefix with Python SDKs


I'm trying to perform searches by multiple prefixes at Google Cloud Bigtable with the Python SDK. I'm using read_rows, and I can't see a good way to search by prefix explicitly.

My first option is RowSet + RowRange. I'm testing three queries, and the times that I'm getting are ~1.5s, ~3.5s and ~4.2s, which are an order of magnitude slower than the searches with the Node SDK (which has a filter option) ~0.19, ~0.13, ~0.46.

The second option is using RowFilterChain + RowKeyRegexFilter. Performance is terrible for two of the queries: ~3.1s, ~70s, ~75s ~0.124s, ~72s, ~69s. It looks like it's doing a full scan. This is the code section:

            regex = f'^{prefix}.*'.encode()
            filters.append(RowKeyRegexFilter(regex)) 

My third option is using the alternative Happybase-based SDK, which has prefix filtering. With that, I'm getting ~36s, ~3s, ~1s ~0.4, ~0.1, ~0.17. The first query involves multiple prefixes, and it doesn't seem to have support for multiple filtering in the same request, so I'm performing as many requests as prefixes and then concatenating the iterators. The other two seem to leverage the prefix filter.

UPDATE: I deleted the first times because there was a mistake with the environment. After doing it properly, times are not bad for range query, but it seems to be room for improvement, as Happybase tests are still faster when they leverage prefix search.

Would appreciate help about using multiple prefix searches in Happybase, or actual prefix search in the main Python SDK.


Solution

  • The read_rows method have two parameters start_key and end_key that you can use to filter efficiently rows based on the row key (see docs). Behind the scenes, this method performs a Scan, so that's why this is probably the most efficient way to filter rows based on their row keys.

    For example, let's suppose you have the following row keys in your table :

    a
    aa
    b
    bb
    bbb
    

    and you want to retrieve all rows with a row key prefixed by a, you can run :

    rows_with_prefix_a = my_table.read_rows(start_key="a", end_key="b")
    

    This will only scan rows between a and b (b excluded), so this will return all rows with row key prefix a (a and aa in the previous example).