Search code examples
node.jsmongodb

Mongodb nodejs driver hangs indefinitely after large update begining with 6.4 driver


Working on a large project. Several dozen collections, 1 million users a day, 10,000+ db transactions a second.

We just had to rollback an update of Mongoose from 8.4.5 to 8.2.0. This rolls back a number of other packages, including going to older version of mongodb, bson, etc.

But the behavior was so breaking, and I cannot find any other reports of a similar problem, I'm looking for hints as to the source.

While I suspect this findOne hang was happening in multiple places, there is one definitive place because we track how many player lookups we handle at the same time. It was hanging here, and never hitting the finally. No exceptions, just hanging. Not every time, either. It runs for ~1 hour before it starts to hang (so I suspect connection pool exceeded or connections being recycled).

let player: DBPlayer | null;
    try {
      this.precheck++;
      player = await models.Player.findOne({
        _id: playerId, // playerId is a string representation of an ObjectId
      })
        .select({
          _id: 1, // not necessary, but for clarity
          authId: 1, // the only field we need
        })
        .lean(); // lean does a different code path in mongoose, so might be relevant
    } finally {
      this.precheck--;
    }

After about an hour with 8.4.5, we see our precheck number creep up, increasing rapidly, then seeming to be stable, then shooting up again in bursts. Of course, what happens is our users are connecting, and the connection times out or they are waiting forever, so they refresh and try again. Each time...a new promise is created here and just waits forever. The number never goes down even after more than 30 minutes.

models.Player is created like this:

models.Player = model<DBPlayer>('Player', schema)

The schema shouldn't be relevant, but it is fairly complex with maps, arrays, hierarchical objects, etc. All works great with 8.2.0

I suspected this had to do with buffering when a new connection is created, but I couldn't find the code in the mongoose codebase that has changed between these versions. The documentation suggests the buffering is just before the initial connection, which is definitely successful as we run for an hour normally.

An hour may be a default for recycling connections, in which case, it really isn't working.


Solution

  • 6+ months after the driver problem was introduced, it seemed nobody was going to do anything about it. I had to learn the mongodb nodejs driver code and find it myself.

    It boils down to a flaw in their socket code, which did not start listening to the incoming data soon enough after sending data on the socket. It would work most of the time, but when there were large amounts of data being sent and the system was under load with a lot of async tasks to process, the 'data' event would come in before the driver was listening to it.

    I submitted a PR to resolve it.

    https://github.com/mongodb/node-mongodb-native/pull/4245