Search code examples
javascriptmapreduceriak

How to tokenize once, reusing token in riak key filter


Using bitcask with riak I have well defined key names that I'm filtering in map-reduce queries using key filters. This is meant to be an experiment in using key filters with bitcask to achieve 2i functionality (and then to compare my application's performance using secondary indexes vs. using key filters).

Riak key filter documentation

Given a bucket containing keys whose names are formatted like version_type_user_timestamp I end up with keys that look like the following.

GET /riak/my_example_bucket?keys=stream HTTP/1.1
Host: localhost
Accept: application/json

{
    "keys": [
        "v0.3_demo.type.1_user12345_1375315200000",
        "v0.3_demo.type.1_user10000_1375315200973",
        "v0.3_demo.type.4_user00288_1375315101004",
        ...
    ]
}
{
    "keys": [
        "v0.3_demo.type.2_user12777_1375315211000",
        "v0.3_demo.type.1_user12777_1375315211782",
        "v0.3_demo.type.2_user50121_1375315101004",
        ...
    ]
}
...

I'm constructing key filters that look like the following. The idea is to do fewer value lookups by filtering results by key before hand.

{
    "bucket": "my_example_bucket",
    "key_filters": [
        [
            "or",
            [
                [
                    "tokenize",
                    "_",
                    2
                ],
                [
                    "eq",
                    "demo.type.1"
                ]
            ],
            [
                [
                    "or",
                    [
                        [
                            "tokenize",
                            "_",
                            2
                        ],
                        [
                            "eq",
                            "demo.type.2"
                        ]
                    ],
                    [
                        [
                            "or",
                            [
                                "tokenize",
                                "_",
                                2
                            ],
                            [
                                "eq",
                                "demo.type.3"
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ]
}

This technique works, but notice how it tokenizes the key on every ["or", [...], [...]] clause. My hypothesis is that if I can tokenize once and feed the result into a pipeline of or clauses all testing for accepted variations of the token then the key filter piece of the map-reduce query will do less work (and therefore the filtering part of the map-reduce query will take less time).

I've tried formatting requests like the following, but this doesn't seem to work.

{
    "bucket": "my_example_bucket",
    "key_filters": [
        [
            "tokenize",
            "_",
            2
        ],
        [
            "or",
            [
                "eq",
                "demo.type.1"
            ],
            [
                "or",
                [
                    "eq",
                    "demo.type.2"
                ],
                [
                    "eq",
                    "demo.type.3"
                ]
            ]
        ]
    ]
}

Is there a way to do this without re-tokenizing in every or clause?


Solution

  • You can probably get the functionality you want using the matches filter and a regular expression. I generated some fake data with

    for i in {1..100}; do curl localhost:8098/buckets/BUCKET/keys/${RANDOM}_${Text[$(($RANDOM / 8196))]}_$i -H "content-type: text/plain" -XPUT -d "$i - $RANDOM"; done
    

    so BUCKET contains 100 keys of the form
    <random number from 0 - 32776>_(AB|DE|FF|RB)_<number from 1 to 100>

    Then I constructed a query using a regular expression to pull all of the keys where the random number started with 1,2, or 8, and the string in the middle was "DE", and the key index was 80 or less like so:

    {
        "inputs":{
             "bucket":"BUCKET",
             "key_filters":[
                  ["matches","^[128][^_]*_DE_[1-8].?$"]
             ]
        },
        "query":[{
            "map":{
                  "language":"javascript",
                  "source":"function(rObject){return [rObject.key]}"
                  }
            }
        ]}
    

    which returned

    ["8461_DE_69","11823_DE_34","21302_DE_83","17568_DE_6","10066_DE_22",
     "1973_DE_68","15742_DE_54","8027_DE_29","25593_DE_50",
     "15301_DE_43","21039_DE_63","24454_DE_39","10350_DE_42","17432_DE_11",
     "15588_DE_2","16895_DE_80","28046_DE_18","14872_DE_75"]
    

    If you can express what you are trying to match in a regular expression, this may work for you with fewer steps.

    For your example, you might use something like:

    [["matches","^[^_]*_demo[.]type[.][123]_.*$"]]