Search code examples
azureworkflow-foundation-4

Windows Workflow 4 Correlation Query includes website instance name in instance key calculation and fails


I am trying to host a long running workflow service on Azure but I am having problems with correlation.
I have got the timeToUnload and the timeToPersist set to 0 and I have ticked the "persist before send" in the worklow - this is not a problem with persistence, it is to do with how instance keys are calculated.

When one web server starts a workflow and another then tries to take another action on the workflow, it fails with

System.ServiceModel.FaultException: The execution of an InstancePersistenceCommand was interrupted because the instance key '12e0b449-7a71-812d-977a-ab89864a272f' was not associated to an instance. This can occur because the instance or key has been cleaned up, or because the key is invalid. The key may be invalid if the message it was generated from was sent at the wrong time or contained incorrect correlation data.

I used the wcf service diagnostic to dig into this and I found that it is because the calculation of the instance key includes the website instance name, thus a given workflow instance can only be called back from the same machine that instantiated it (because Azure set a different website instance name on each role instance).

To explain, when I create a new instance of the workflow, I have an activity that gets the workflow instance Guid and then returns that guid and also uses the correlation initializer to set the correlation handle.

I have enabled Service Tracing in web.config so in the Service Trace Viewer I can see the following happening when I instantiate a new instance of the workflow;

<ApplicationData >
    <TraceData >
        <DataItem >
            <TraceRecord Severity ="Information" Channel="Analytic " xmlns="http://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord ">
                <TraceIdentifier >225</ TraceIdentifier>
                <Description >Calculated correlation key '496e3207-fe9d-919f-b1df-f329c5a64934' using values 'key1:10013d62-286e-4a8f-aeb2-70582591cd7f,' in parent scope '{/NewOrbit.ExVerifier.Web_IN_2_Web/Workflow/Application/}Application_default1.xamlx'.</Description >
                <AppDomain >/LM/W3SVC/1273337584/ROOT-1-129811251826070757</AppDomain >
            </TraceRecord >
        </DataItem >
    </TraceData >
</ApplicationData >

The important line is this:

Calculated correlation key '496e3207-fe9d-919f-b1df-f329c5a64934' using values 'key1:10013d62-286e-4a8f-aeb2-70582591cd7f,' in parent scope '{/NewOrbit.ExVerifier.Web_IN_2_Web/Workflow/Application/}Application_default1.xamlx'.

The Guid of this particular workflow instance is 10013d62-286e-4a8f-aeb2-70582591cd7f so the workflow engine calculates an "instance key" from that which is 496e3207-fe9d-919f-b1df-f329c5a64934. I can see the workflow instance with the guid in [System.Activities.DurableInstancing].[InstancesTable] and I can see the instance key in [System.Activities.DurableInstancing].[KeysTable]. So far, so good and if the same server makes a later call to that same workflow, everything works fine. However, if a different server tries to access the workflow, I get the correlation error mentioned above. Once again looking at the diagnostics trace, I can see this:

<TraceData >
    <DataItem >
        <TraceRecord Severity ="Information" Channel="Analytic " xmlns="http://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord ">
            <TraceIdentifier >225</ TraceIdentifier>
            <Description >Calculated correlation key '12e0b449-7a71-812d-977a-ab89864a272f' using values 'key1:10013d62-286e-4a8f-aeb2-70582591cd7f,' in parent scope '{/NewOrbit.ExVerifier.Web_IN_5_Web/Workflow/Application/}Application_default1.xamlx'.                     </Description >
            <AppDomain >/LM/W3SVC/1273337584/ROOT-1-129811251818669004</AppDomain >
        </TraceRecord >
    </DataItem >
</TraceData >

The important line is

Calculated correlation key '12e0b449-7a71-812d-977a-ab89864a272f' using values 'key1:10013d62-286e-4a8f-aeb2-70582591cd7f,' in parent scope '{/NewOrbit.ExVerifier.Web_IN_5_Web/Workflow/Application/}Application_default1.xamlx'.

As you can see, it is the same Guid being passed in but the system includes the name of the website instance in the calculation of the Instance key so it ends up with a completely different instance key.

I have created a completely new project to test this out and found the exact same problem. I feel I must be doing something very simple wrong as I can't find anyone else with the same problem.


Solution

  • A few months later and I have found a solution to this problem. The root problem is that Azure names the Web site something different on each role instance; Rather than "Default Web SIte", the web site is called something like NewOrbit.ExVerifier.Web_IN_0_Web (given a namespace for your web project of NewOrbit.ExVerifier.Web). Workflow uses the website name as part of the algorithm used to calculate the instance key, hence the problem.

    The solution is, quite simply, to rename the website during role startup so it is called the same thing on all instances. Fixing the root problem rather than handling the consequences and so obvious I never saw it the first time round.

    Here is how you can do this (losely based on this: http://blogs.msdn.com/b/tomholl/archive/2011/06/28/hosting-services-with-was-and-iis-on-windows-azure.aspx)

    Configure powershell to have elevated access rights so you can make changes after IIS has been configured:

    In ServiceDefinition.csdef add a startup task:

    <ServiceDefinition name="WasInAzure" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
      <WebRole name="WebRole1">
          ...
          <Startup>
              <Task commandLine="setup\startup.cmd" executionContext="elevated" />
          </Startup>
      </WebRole>
    </ServiceDefinition>
    

    Setup\Startup.cmd should have this content:

    powershell -command "set-executionpolicy Unrestricted" >> out.txt
    

    Configure Role OnStart to have admin priviliges

    In ServiceDefinition.csdef add this:

    <ServiceDefinition name="WasInAzure" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition">
      <WebRole name="WebRole1">
      ...
        <Runtime executionContext="elevated" />
      </WebRole>
    </ServiceDefinition>
    

    Create a powershell script to rename the web site

    Create a setup\RoleStart.ps1 file:

    write-host "Begin RoleStart.ps1"
    import-module WebAdministration
    $siteName = "*" + $args[0] + "*"
    Get-WebSite $siteName | Foreach-Object { 
        $site = $_;
        $siteref = "IIS:/Sites/" + $site.Name;
        try {
            Rename-Item $siteref 'MyWebSite'
            write-host $siteName + " was renamed"
        }
        catch
        {
           write-host "Failed to rename " + $siteName + " : " + $error[0]
        }
    }
    write-host "End RoleStart.ps1"
    

    (replace MyWebSite with whatever you want the website to be called on all the servers).

    Run RoleStart.ps1 on role start:

    Create or Edit WebRole.cs in the root of your website project and add this code:

    public class WebRole : RoleEntryPoint
    {
        public override bool OnStart()
        {
            var startInfo = new ProcessStartInfo()
            {
                FileName = "powershell.exe",
                Arguments = @".\setup\rolestart.ps1",
                RedirectStandardOutput = true,
                UseShellExecute=false,
            };
            var writer = new StreamWriter("out.txt");
            var process = Process.Start(startInfo);
            process.WaitForExit();
            writer.Write(process.StandardOutput.ReadToEnd());
            writer.Close();
            return base.OnStart();
        }
    }
    

    And that should be it. If you spin up multiple web role instances and connect to them with RDP, you should now be able to see that the website is called the same on all the instances and workflow persistence therefore works.