Search code examples
google-cloud-bigtable

Regex filters on Bigtable queries - performance/recommendations


I'd like to ask about recommendations and performance considerations of using a Regex queries in Bigtable with or without a prefix.

We have information at the end of the row key that we need to filter with Regex.

Would Bigtable need to do a full table scan to perform a Regex query that does not include a prefix? What are the performance considerations? Is this recommended?

How would bringing in a prefix to the query impact the recommendation?

Would appreciate advice/thoughts on this as we optimize our schema.


Solution

  • I'm on the eng team for Cloud Bigtable.

    The Bigtable filter engine will attempt to parse out any prefix present in your Regex query and use it to reduce the scope of the scan.

    Edit (2019-05-14): Turns out this isn't quite accurate. Bigtable will parse out the prefix and use it to seek past irrelevant data, but this happens separately for each tablet. And in particular, we still have to send a request to each tablet even if the tablet ends up getting entirely skipped. So this will be much faster than a true full table scan, but will still have performance issues. We are looking into improvements.

    However, if you don't provide a prefix then Bigtable has nothing to go on, since any row could potentially match. As such, this type of query will result in a full table scan. Large scans are not recommended for queries which need to perform well, so it's best to arrange your row key to avoid them as much as possible. You can find more information about schema design in the docs.

    Note that you can always set explicit row bounds on your scans in all of our supported clients. This is useful for limiting the size of an otherwise unbounded scan, but you can also use it to read multiple shards of the table in parallel if you truly need to accelerate a large query: