Search code examples
c#azure-cosmosdbazure-cosmosdb-sqlapi

How do I get UNIQUE categories from all documents in CosmosDB?


I have millions of documents in CosmosDB using SQL API, and I need to find the unique categories from all documents.

The documents looks like follows, you can see the categories array just under the description, I dont care in what order they are I just need to know all the unique ones from all documents in the collection, I need this so that later on I can create queries on the categories but thats a later question I first need to get them all out so I know what all the possible options are, but I am unable to figure out the query to do this so that I get only the category names.

{
    "id": "56d934d3-90bf-4f5a-b602-e515fefa599f",
    "_id": "5bf6705f9568cf00013cd13c",
    "vendor": "XXX",
    "updatedAt": "2018-11-23T03:55:30.044Z",
    "locales": [
        {
            "title": "Cold shoulder t-shirt",
            "description": "Because collar bones. Trending cold shoulder t-shirt in 100% organic cotton. Classic, wide and boxy t-shirt fit with cut-out details. In black, because black tees and fashion are like this (insert friendly hand gesture). This style is online exclusive.",
            "categories": [
                "Women",
                "clothing",
                "tops"
            ],
            "brand": null,
            "images": [
                "https://lp.xxx.com/app002prod?set=source[01_0659881_001_102],type[ECOMLOOK],device[hdpi],quality[80],ImageVersion[2018081]&call=url[file:/product/main]",
                "https://lp.xxx.com/app002prod?set=source[01_0659881_001_203],type[ECOMLOOK],device[hdpi],quality[80],ImageVersion[2018081]&call=url[file:/product/main]",
                "https://lp.xxx.com/app002prod?set=source[01_0659881_001_301],type[ECOMLOOK],device[hdpi],quality[80],ImageVersion[2018081]&call=url[file:/product/main]",
                "https://lp.xxx.com/app002prod?set=source[02_0659881_001_101],type[PRODUCT],device[hdpi],quality[80],ImageVersion[1.0]&call=url[file:/product/main]"
            ],
            "country": "SE",
            "currency": "SEK",
            "language": "en",
            "variants": [
                {
                    "artno": "0659881001",
                    "urls": [
                        "https://click.linksynergy.com/link?id=INtcw3sexSw&offerid=491018&type=2&murl=https%3A%2F%2Fwww.xxx.com%2Fen_sek%2Fclothing%2Ftops%2Fproduct.cold-shoulder-t-shirt-black-magic.0659881001.html"
                    ],
                    "price": 80,
                    "stock": 0,
                    "attributes": {
                        "size": "XXS",
                        "color": "Black magic"
                    }
                },
                {
                    "artno": "xxx",
                    "urls": [
                        "https://click.linksynergy.com/link?id=INtcw3sexSw&offerid=491018&type=2&murl=https%3A%2F%2Fwww.xxx.com%2Fen_sek%2Fclothing%2Ftops%2Fproduct.cold-shoulder-t-shirt-black-magic.0659881001.html"
                    ],
                    "price": 80,
                    "stock": 0,
                    "attributes": {
                        "size": "XS",
                        "color": "Black magic"
                    }
                },
                {
                    "artno": "0659881001",
                    "urls": [
                        "https://click.linksynergy.com/link?id=INtcw3sexSw&offerid=491018&type=2&murl=https%3A%2F%2Fwww.xxx.com%2Fen_sek%2Fclothing%2Ftops%2Fproduct.cold-shoulder-t-shirt-black-magic.0659881001.html"
                    ],
                    "price": 80,
                    "stock": 0,
                    "attributes": {
                        "size": "XL",
                        "color": "Black magic"
                    }
                },
                {
                    "artno": "0659881001",
                    "urls": [
                        "https://click.linksynergy.com/link?id=INtcw3sexSw&offerid=491018&type=2&murl=https%3A%2F%2Fwww.xxx.com%2Fen_sek%2Fclothing%2Ftops%2Fproduct.cold-shoulder-t-shirt-black-magic.0659881001.html"
                    ],
                    "price": 80,
                    "stock": 0,
                    "attributes": {
                        "size": "S",
                        "color": "Black magic"
                    }
                },
                {
                    "artno": "0659881001",
                    "urls": [
                        "https://click.linksynergy.com/link?id=INtcw3sexSw&offerid=491018&type=2&murl=https%3A%2F%2Fwww.xxx.com%2Fen_sek%2Fclothing%2Ftops%2Fproduct.cold-shoulder-t-shirt-black-magic.0659881001.html"
                    ],
                    "price": 80,
                    "stock": 1,
                    "attributes": {
                        "size": "M",
                        "color": "Black magic"
                    }
                },
                {
                    "artno": "0659881001",
                    "urls": [
                        "https://click.linksynergy.com/link?id=INtcw3sexSw&offerid=491018&type=2&murl=https%3A%2F%2Fwww.xxx.com%2Fen_sek%2Fclothing%2Ftops%2Fproduct.cold-shoulder-t-shirt-black-magic.0659881001.html"
                    ],
                    "price": 80,
                    "stock": 0,
                    "attributes": {
                        "size": "L",
                        "color": "Black magic"
                    }
                }
            ]
        }
    ],
    "_rid": "QEwcALNbIz8GAAAAAAAAAA==",
    "_self": "dbs/QEwcAA==/colls/QEwcALNbIz8=/docs/QEwcALNbIz8GAAAAAAAAAA==/",
    "_etag": "\"6a0003c6-0000-0000-0000-5bf7958c0000\"",
    "_attachments": "attachments/",
    "_ts": 1542952332
}

Solution

  • Please see my test, it could get all the unique categories names.

    Sample document:

    [
        {
            "id": "1",
            "locales": [
                {
                    "categories": [
                        "Women",
                        "clothing",
                        "tops"
                    ]
                }
            ]
        },
        {
            "id": "2",
            "locales": [
                {
                    "categories": [
                        "Men",
                        "test",
                        "tops"
                    ]
                }
            ]
        }
    ]
    

    SQL:

    SELECT distinct cat FROM c
    join l in c.locales
    join cat in l.categories
    

    Output:

    [
        {
            "cat": "Women"
        },
        {
            "cat": "clothing"
        },
        {
            "cat": "tops"
        },
        {
            "cat": "Men"
        },
        {
            "cat": "test"
        }
    ]
    

    If you don't want to case sensitive,just use LOWER function in sql.

    SELECT distinct Lower(cat) FROM c
    join l in c.locales
    join cat in l.categories
    

    If you want to get ["Women","clothing","tops","Men","test"], it can't be parsed as an array in single sql directly, you could use stored procedure to parse the output array.

    For example, add below code in stored procedure.

        var returnArray = [];
        for(var i=0 ;i<array.size;i++){
            returnArray.push(array[i].value)
        }
        return returnArray;