Few weeks ago we have small fail, when on staging environment one of our schedulers failed to start, so part of our background tasks just didn’t started. Few months ago we have had an issue, when wrong configuration on one of the web facing boxes have invalid serialization configuration and so public API calls to load balanced farm was failing on random basis. Why I am talking about this? Well, week ago we eventually decided that we need kind of health monitoring for our environment.
To the time being we already have some integration tests (NUnit) that test our public endpoints and environment stuff. The only problem that they run only after deployment builds.
The idea is simple. Schedule integration tests to run every XX minutes.
I think most build servers can do this without problems. I do not know why we didn’t made this ages ago.
Now, I am about to write some tests that will use heuristics based on usage data of our services, so we probably be able to early identify if something goes wrong. Later we will expand this to check database health, disk usage and so on.
If this stuff will work for us, we will open source our library with assertions for environment.