Service Fabric ImageStoreService fails to replicate

I have installed Service Fabric on three VMs, with 5 nodes per VM on Windows Server 2016 (each configured with separate NodeType as to avoid port conflicts etc.) This is similar to running the OneBox Service Fabric with 5 nodes on a dev machine.

All seems well during installation, and all services start correctly. The problem is that the ImageStoreService fails to complete it's replication cycle with one of the 3 nodes (beta2, gamma4 & beta0 below) staying in In Build instead of completing.

The service itself reports:

Error event: SourceId='System.FM', Property='State'. Partition is below target replica or instance count. ImageStoreService 3 3 00000000-0000-0000-0000-000000003000 N/P RD beta2 Up 131372506454740092 N/S IB gamma4 Up 131372506515241065 N/S RD beta0 Up 131372506515241066 (Showing 3 out of 3 replicas. Total available replicas: 2.)

I've made sure the shared folders created by each System Service is available and have a backing folder on disk (sometimes the uninstall process create orphans). I've disabled Windows Firewall on all three servers to prevent any blocking. I've reinstalled Windows Server 2016 on all three machines and reinstalled Service Fabric, and the problem remains.

Update Based on comments to the question, I have created a new configuration and deployed it across 3 VMs (as before) but running only 1 node per VM.

Again the services start up fine, but the ImageStoreService reports:

Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false. Partition reconfiguration is taking longer than expected. ImageStoreService 3 3 00000000-0000-0000-0000-000000003000 P/P RD gamma Up 131376836149092409 S/S IB alpha Up 131376836457801126 S/S IB beta Up 131376836457801127 (Showing 3 out of 3 replicas. Total available replicas: 1.)

This Warning becomes an Error over time. It seems that as soon as the replication for the ImageStore has to span over VMs it fails to complete replication.

My question is if anyone has come across this before? Any suggestions about what could make the replication fail? Where in the installation cluster is error information stored related to replication events?

Solution

One machine should be one cluster node, not 5. More info here.

Each node in a standalone Service Fabric cluster has the Service Fabric runtime deployed and is a member of the cluster. In a typical production deployment, there is one node per OS instance (physical or virtual).