Search code examples
stringmongodbmapreduceaggregation-frameworkmongodb-aggregation

Finding most commonly used word in a string field throughout a collection


Let's say I have a Mongo collection similar to the following:

[
  { "foo": "bar baz boo" },
  { "foo": "bar baz" },
  { "foo": "boo baz" }
]

Is it possible to determine which words appear most often within the foo field (ideally with a count)?

For instance, I'd love a result set of something like:

[
  { "baz" : 3 },
  { "boo" : 2 },
  { "bar" : 2 }
]

Solution

  • There was recently closed a JIRA issue about a $split operator to be used in the $project stage of the aggregation framework.
    With that in place you could create a pipeline like this

    db.yourColl.aggregate([
        {
            $project: {
                words: { $split: ["$foo", " "] }
            }
        },
        {
            $unwind: {
                path: "$words"
            }
        },
        {
            $group: {
                _id: "$words",
                count: { $sum: 1 }
            }
        }
    ])
    

    result would look like so

    /* 1 */
    {
        "_id" : "baz",
        "count" : 3.0
    }
    
    /* 2 */
    {
        "_id" : "boo",
        "count" : 2.0
    }
    
    /* 3 */
    {
        "_id" : "bar",
        "count" : 2.0
    }