azure-data-factory azure-virtual-network

Failure of azure data factory integration runtime with Vnet enabled

I had been using Data Factory's integration runtime with VNet successfully, but it recently stopped connecting to Cosmos DB with the MongoDB API (which is also within a VNet). After setting up a new integration runtime with VNet enabled and selecting the region as 'Auto Resolve,' the pipeline ran successfully with this new runtime.

Could you help me understand why the previous integration runtime—configured with VNet enabled and the region set to match that of Azure Data Factory—worked for over a month but then suddenly failed? The new integration runtime with VNet and 'Auto Resolve' region worked, but I'm uncertain if the 'Auto Resolve' region contributed to the success or if something else allowed it to connect. Error:Failure happened on 'Source' side. ErrorCode=MongoDbConnectionTimeout,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=>Connection to MongoDB server is timeout.,Source=Microsoft.DataTransfer.Runtime.MongoDbAtlasConnector,''Type=System.TimeoutException,Message=A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "ReplicaSet", Type : "ReplicaSet", State : "Disconnected", Servers : [{ ServerId: "{ ClusterId : 1, EndPoint : "Unspecified/cosmontiv01u.mongo.cosmos.azure.com:10255" }", EndPoint:

Solution

Let me help break down what might have happened here. This is an interesting infrastructure issue that touches on several Azure networking concepts. A few potential reasons why your original setup stopped working:

Network Configuration Changes:

VNet peering settings might have been modified Network Security Group (NSG) rules could have been updated Subnet configurations might have changed Service endpoints status might have been altered

Integration Runtime Issues:

The manually selected region might have experienced capacity issues The runtime's network configurations might have become stale There could have been an IR version update that affected the networking stack

Cosmos DB Changes:

Firewall rules might have been modified Private endpoint configurations could have changed Network access settings might have been updated

The success with 'Auto Resolve' region suggests a few things:

The 'Auto Resolve' setting is more resilient because:

It can dynamically choose the optimal region based on network conditions It can failover to different regions if there are connectivity issues It might use a different networking path to reach Cosmos DB