I'm running a complex server setup for a defacto high-availability service. So far it takes me about two days to set everything up so I would like to automate the provisioning.
However I do a quite a lot of manual changes to (running) server(s). A typical example is changing a firewall configuration to cope with various hacking attempts, packet floods etc. Being able to work on active nodes quickly is important. Also the server maintains a lot of active TCP connections and loosing those for a simple config change is out of question.
I don't understand if either Chef or Puppet is designed to deal with this. Once I change some system config, I would like to store it somewhere and use it while the next instance is being provisioned. Should I stick with one of those tools or choose a different one?
Hand made changes and provisioning don't take hands. They don't even drink tea together.
At work we use puppet to manage all arquitecture, and as you we need to do hand made changes in a hurry due to performance bottlenecks, attacks, etc.
What we do is first make sure puppet is able to setup every single part of the arquitecture ready to be delivered without any specific tuning.
Then when we need to do hand made changes, if in a hurry as long you don't mess with files managed by puppet there's no risk, if it's a puppet managed file what we need to change then we just stop puppet agent and do whatever we need.
After hurry ended, we proceed as follows:
These changes should be applied to all servers with same symptoms ?
If so, then you can develop what puppet call 'facts' which is code that it's run on the agent on each run and save results in variables available in all your puppet modules, so if for example you changed ip conntrack max value because a firewall was not able to deal with all connections, you could easily (ten lines of code) have in puppet on each run a variable with current conntrack count value, and so tell puppet to set a max value related to current usage. Then all other servers will benefit for this tunning and likely you won't ever have to deal with conntrack issues anymore (as long you keep running puppet with a short frequency which is the default)
These changes should be always applied by hand on given emergencies?
If configuration is managed by puppet, find a way to make configuration include other file and tell puppet to ignore it. This is the easiest way, however it's not always possible (e.g. /etc/network/interfaces does not support includes). If it's not possible, then you will have to stop puppet agent during emergencies to be able to change puppet files without risk of being removed on next puppet run.
Are this changes only for this host and no other host will ever need it?
Add it to puppet anyway! Place a sweet if $fqdn == my.very.specific.host and put inside whatever you need. Even for a single case it's always beneficial (and time consuming) to migrate all changes you do to a server, as will allow you to do a full restore of server setup if for some reason your server crash to a not recoverable state (e.g. hardware issues)
In summary:
For me the trick in dealing with hand made changes it's putting a lot of effort in reasoning how you decided to do the change and after emergency is over move that logic into puppet. If you felt something was wrong because for a given software slots all were used but free memory was still available on the server so to deal with the traffic peak was reasonable to allow more slots to be run, then spend some time moving that logic into puppet. Very carefully of course, and as time consuming as the amount of different scenarios on your architecture you want to test it against, but at the end it's very, VERY rewarding.