Search code examples
mongodbmongodb-queryaggregation-frameworkgraphlookup

MongoDB aggregation $graphLookup - find "connections" in common in collections of following relationships


I have a collection of "who is following who" (like Instagram):

db.users.insertMany([
  { _id: 1, name: "Arnold Schwarzenegger" },
  { _id: 2, name: "James Earl Jones" },
  { _id: 3, name: "Harrison Ford" },
  { _id: 4, name: "Jennifer Lawrence" }
]);

db.follows.insertMany([
  { _id: 12, follower: 1, following: 2 },
  { _id: 13, follower: 1, following: 3 },
  { _id: 24, follower: 2, following: 4 },
  { _id: 23, follower: 2, following: 3 }
]);

I'm trying to suggest other users whom others they can follow. i.e. which other people they could be following; suggested followers, ordered by the number of existing common connections.

In this example:

+--------+--------------+----------+
|   A    | is following |    B     |
+--------+--------------+----------+
| Arnold | ->           | James    |
| Arnold | ->           | Harrison |
| James  | ->           | Jennifer |
| James  | ->           | Harrison |
+--------+--------------+----------+

Between Arnold and James, who can Arnold follow? (excluding existing connections)

The answer should be: Jennifer

This is poor attempt:

db.users.aggregate([
  {
    $match: { _id: 1 } // Arnold
  },
  {
    $graphLookup: {
      from: "follows",
      startWith: "$_id",
      connectFromField: "following",
      connectToField: "follower",
      maxDepth: 1,
      as: "connections",
    }
  }
]);

Which results in:

  {
    "_id": 1,
    "name": "Arnold Schwarzenegger",
    "connections": [
      {
        "_id": 24,
        "follower": 2,
        "following": 4
      },
      {
        "_id": 13,
        "follower": 1,
        "following": 3
      },
      {
        "_id": 23,
        "follower": 2,
        "following": 3
      },
      {
        "_id": 12,
        "follower": 1,
        "following": 2
      }
    ]
  }

I believe I need to do some $unwind'ing, but I'm kind of stuck now


Solution

  • Here are two possible approaches. (I haven't tested with larger data sets so your mileage may vary!)

    The first one builds on your $graphLookup stage:

    db.users.aggregate([
      { $match: { _id: 1 }},
      { $graphLookup: {
        from: 'follows',
        startWith: '$_id',
        connectFromField: 'following',
        connectToField: 'follower',
        maxDepth: 1,
        as: 'connections'
      }},
      { $unwind: { path: '$connections' }},
      { $group: {
        _id: '$connections.follower',
        follows: {
          $addToSet: '$connections.following'
        }
      }},
      { $unwind: { path: '$follows' }},
      { $group: {
        _id: '$follows',
        isFollowedBy: {
          $addToSet: '$_id'
        }
      }},
      { $match: { isFollowedBy: { $not: { $in: [1] }} }},
      { $group: {
        _id: null,
        newConnections: {
          $addToSet: '$_id'
        }
      }},
      { $project: { _id: 0 }}
    ])
    

    Note that this pipeline ends up building the relationships from the other collection mid-way so another approach is to start with the other collection as follows:

    db.follows.aggregate([
      { $lookup: {
        from: 'follows',
        localField: 'following',
        foreignField: 'follower',
        as: 'potentialSet'
      }},
      { $unwind: {
        path: "$potentialSet",
        preserveNullAndEmptyArrays: true
      }},
      { $group: {
        _id: "$follower",
        "alreadyFollowing": {
          $addToSet: "$following"
        },
        "potentialConnections": {
          "$addToSet": "$potentialSet.following"
        }
      }},
      { $project: {
        newConnections: { $setDifference: [ "$potentialConnections", "$alreadyFollowing" ] }
      }},
      { $match: { _id: 1 }},
      { $project: { _id: 0 }}
    ])
    

    If it helps I used MongoDB Compass Community Edition to help build these pipelines. It's pretty cool in that it allows you to quickly iterate and see the output of each stage which really helps when you're trying to debug the pipeline.