Let's say I have a Mongo collection similar to the following:
[
{ "foo": "bar baz boo" },
{ "foo": "bar baz" },
{ "foo": "boo baz" }
]
Is it possible to determine which words appear most often within the foo
field (ideally with a count)?
For instance, I'd love a result set of something like:
[
{ "baz" : 3 },
{ "boo" : 2 },
{ "bar" : 2 }
]
There was recently closed a JIRA issue about a $split
operator to be used in the $project
stage of the aggregation framework.
With that in place you could create a pipeline like this
db.yourColl.aggregate([
{
$project: {
words: { $split: ["$foo", " "] }
}
},
{
$unwind: {
path: "$words"
}
},
{
$group: {
_id: "$words",
count: { $sum: 1 }
}
}
])
result would look like so
/* 1 */
{
"_id" : "baz",
"count" : 3.0
}
/* 2 */
{
"_id" : "boo",
"count" : 2.0
}
/* 3 */
{
"_id" : "bar",
"count" : 2.0
}