I'm on a new team that has inherited a cross-platform application, along with installers for Windows, Mac, and Linux.
The application is a silent background agent responsible for applying patches and other configurations as specified by Your Friendly IT Team in our web back end, and if an installation fails, then IT loses their control over the machine and must reinstall our software manually. Failures are relatively unlikely, but they can happen for any number of reasons -- e.g. if the user chooses to reboot while we're in the middle of installing an upgrade (not to mention power loss and other such failures). We have enough devices under management in the field that this is definitely an issue we need to contend with regularly. Given that the whole point of our software is to eliminate the need for manual intervention, it's not looked upon fondly when our customers have to intervene manually in order to repair our app.
The Windows installer is an MSI built using WixSharp 1.9.3 (WiX 3) on .NET 4.5.1. I have minimal familiarity with development for Windows platforms and .NET in general, let alone any of the specifics of the Installer API, WiX, or WixSharp, so I may be missing something fairly obvious, but I've been digging around in the docs for all of the above for over a week now, and while I have learned a ton, so far I've been unable to figure this out.
From what I can tell from my installer log, the RemoveFiles action runs early in the installation process, and it removes any previous installation of my application before installing the new one. Because of this, there is a period of time in which no version of our application is installed. That's bad for recoverability!
Let's call the existing binary agent.exe
. To maximize recoverability, being unaware of this RemoveFiles step, we modified the installer so that it:
agent-{newVersion}.exe
agent-{newVersion}.exe
. Aborts if those tests fail.agent.exe
as agent-{oldVersion}.exe
agent-{newVersion}.exe
to agent.exe
(overwriting the old one)Of course, some installs are fresh installs rather than upgrades, so all of the PostInstall steps have to tolerate when the old files are missing -- which led to the appearance that everything was working correctly, even though the old files are always missing.
When testing, we introduced a delay in step 1 and then forced the test machine to reboot during the delay. When the machine comes back up, the installation is "suspended" and there is no agent.exe
, and so our Service fails to start, i.e. there's no agent running. This is easily resolved by re-running the installer -- but that requires manual intervention.
So, how do I configure the MSI installer such that it does not remove the old binary before it writes the new one? Can I suppress RemoveFiles entirely and replace it with a custom action? Can I configure RemoveFiles itself to behave differently? (or am I going at this whole thing all wrong?)
Note that I still need RemoveFiles (or a custom replacement) to run during uninstall, and in that scenario I need it to delete agent.exe
as well as any agent-{version}.exe
copies.
Thanks to Rob Mensching's answer, I found that I can do:
project.MajorUpgrade = new MajorUpgrade {
Schedule = UpgradeSchedule.afterInstallExecute,
};
This moves the invocation of RemoveExistingProducts
(which I infer is a parent of RemoveFiles
) later in the process, after I've performed my self-test and copied agent-{newVersion}.exe
onto agent.exe
. The installation can now survive an unfortunately timed restart, after which the old version remains in place and functioning. Hooray!
But, not so fast: now, somewhere later in the process, RemoveExistingProducts still runs, and still deletes agent.exe
. So now, after the installer exits successfully, the binary is missing. Bug technically fixed, outcome objectively worse.
But now I'm aware of RemoveExistingProducts, and so I tried this:
<InstallExecuteSequence>
<RemoveExistingProducts Suppress="yes"/>
</InstallExecuteSequence>
This yields an error message:
The InstallExecuteSequence table contains an action 'RemoveExistingProducts' that cannot be suppressed because it is not declared overridable in the base definition. Please stop suppressing the action or make it overridable in its base declaration.
This led me to this question, the answers for which are "Why would you want to suppress that? It's the whole point of MajorUpgrade."
My very limited understanding is that my alternative would be to introduce a new ProductId and do an Install rather than a MajorUpgrade. Unfortunately, due to the nature of our product, I cannot. A large portion of our customers have explicit allow-lists of product IDs that can be installed on their devices; we cannot have a rolling product ID that changes with every version my team releases.
The adoption of Wix and WixSharp predates my involvement with the project by a few years, and sadly the folks who made those decisions are no longer with the company. Likewise the use of MajorUpgrade. The only thing that I am sure needs to remain is the use of Windows Installer with an MSI file.
How should I proceed? Do Installer and/or Wix have an alternative to MajorUpgrade which can do what I need?
I decided that my custom postinstall process was part of the problem. In particular, the fact that the new installer didn't know that agent.exe
was still part of the new installation, which I presume was why RemoveExistingProducts was deleting it. That functionality was in place in order to maintain consistent installer behavior across Windows, Mac, and Linux. But it was now apparent that this was making me fight against the way MSI is designed to work. So, I rolled back all of that stuff. Just install the files in place and run the tests!
That helped. But I still had custom actions installing and removing the Service, and now I found myself in a place where the Service was being removed at the end of an otherwise successful installation. Having read at length about all the pitfalls of custom actions, I decided to eliminate the custom actions for installing and removing the Service and replace those with the equivalent Wix-native actions. This went well, except WixSharp generated invalid XML (duplicate keys on some of the ServiceControl
tags).
The setup I inherited was using an old version of WixSharp (1.9.3). My lack of familiarity with the whole ecosystem had me hesitating to make significant changes, but I decided to upgrade it anyway. On the latest 1.x version (1.24.2, released last week as of this writing), this resolved the invalid XML issue, but now we were back to uninstalling the service at the end of an otherwise-successful installation. No change here!
Digging into the log files more than I had previously, and correlating what I was seeing there with the InstallExecuteSequence table in Orca, I learned that this was happening as part of my old friend, RemoveExistingProducts.
In Orca, I noticed that the conditional actions declare their conditions right there in the InstallExecuteSequence table. I copied (NOT UPGRADINGPRODUCTCODE) AND (REMOVE="ALL")
from another row and pasted it into the RemoveExistingProducts
row. I then tried an upgrade using that modified version of the installer, and everything worked!
I then went back and reproduced that config change with WixSharp code:
...
project.WixSourceGenerated += MakeRemoveExistingProductsConditional;
ValidateAssemblyCompatibility();
WixSharp.MSBuild.EmitAutoGenFiles = true;
Compiler.BuildMsi(project);
}
static void MakeRemoveExistingProductsConditional(XDocument document) {
var rep = document.FindFirst("InstallExecuteSequence").AddElement("RemoveExistingProducts");
rep.Add(new XText(Condition.BeingUninstalledAndNotBeingUpgraded.Value));
}
So now we're mostly good. One problem: Now, when I interrupt the installation as before, the machine comes back up with a functioning, upgraded installation -- the agent.exe
is the new version, and the Service is present and running. Sounds good, right? Except the stuff that the installer was supposed to do after this involved running some tests against the new binary. Those tests haven't been run, which means we don't know if we've left the agent in a healthy state. The desired behavior here is that it should either resume where it left off or roll back so we can try again later. The installation is "suspended," and the registry still has record of the rollback script. As far as I can tell, Windows will never automatically resume the installation attempt. I can resolve it manually by running the MSI again, but our users have no way of knowing that, nor is it acceptable from a product standpoint for this responsibility to fall on them.
So I guess now I have a new question -- what's the "right" way to resume a suspended MSI installation?