My team had the chance to deal with a few fires today, and how we went about it got me thinking (and writing, apparently).
It matters quite a bit how you find out about problems, it seems. Perhaps this is obvious in retrospect, but today definitely highlighted it.
For instance, we had a bug on one of our deployed applications that wasn’t noticed on any of our test environments, and affected our production servers only. No-one noticed it outside the team, we simply saw an error in our email, and a line on a graph showing that something hadn’t gone quite right, so we dove in and worked on it. This particular one was a tough nut to crack, but we kept at it and got it fixed in a couple of hours – often we fix things much quicker.
At the same time, our monitors showed a problem with a lack of disk space on another (test) server, so we got some help from another team and had that one sorted out pretty quickly, although it certainly complicated our lives a bit.
Then we found a count we checked against another source for another urgent project seemed wrong, so we spawned a task to check that out….
At then end of the day, the cybernetic gods smiled on us and we got everything back in good er, and before five o’clock, so we all left feeling pretty positive.
Thinking back on the day, I realized that none of the problems today were discovered outside our own team. Although we made a point of mentioning to other teams that we had a problem, as it might affect work they were doing, no-one outside the team detected the problems.
Thinking back on it, I’m pretty proud of that. It indicates two things, I believe – first, that when something starts to go wrong, we’ve got a pretty good monitoring system. Secondly, when something goes wrong the harm is contained to a very small portion of our system.
For the first, we make extensive use of a number of monitoring tools, in this case it was Icinga that first notified us of the problem, then we immediately turned to our graphs produced in near-real-time by Graphite to see the source of the problem. Once we identified the flow problem, we used Graylog to get the detailed traces we needed to fully locate the issue.
So, that was the fire alarm. Why didn’t the fire spread?
Our entire system is organized around an Event-Driven architecture, with many small deployable units. We separate our write processes from our read models, so even if a write process goes down (which in this case it did), the read model is still fully available, and was in fact in constant use. The read model, if you do it right, it usually considerably simpler than the write process, so it’s easier to ensure it keeps going.
Where do your fires start, or at least, from what source do you come to notice them?
If it’s your own monitoring, that’s pretty good. If something emails you and says “fix this, right here”, you win compared to if your customer calls and says “something strange just happened…. somewhere”, or if another team notices the issue before you do.