Get familiar with normal

18 March 2023
IT support

Systems are complex things with many parts interacting with each other in many different ways. There's the hardware: memory, storage, CPU; network: cards, cables, switches, routers; and devices: HSMs, load balancers. And then there's the software - layers of virtualisation, operating systems, middleware, databases, and the application itself.

A small issue somewhere can have big consequences, especially under load. A database routine starts running which doubles the DB response time from 30 to 60ms. The capacity of the network halves as requests take longer to fulfil. Connection pools fill up and requests start to time out. Memory limits start to be reached and software starts paging to disk. The application stops working and the infrastructure is bombarded with retries from the user.

Of course, no-one's told the IT team that the database has slowed down a bit so currently they've no idea what's going on. They've checked all the database monitoring. A 60ms response time looks fine. The issue must be further up the chain, surely?

But we know better. We know that the response time is usually 30ms. When there weren't any issues we spent some time perusing the monitoring graphs to get familiar with normal. We understand how many transactions we normally get at busy times of the day. We know how much network bandwidth is normally utilised. We know what's normally being written to the logs and how frequent the garbage collection normally is.

We point the anomaly out to the DBA who discovers an archiving job running at the wrong time of day following the database upgrade several months ago (the archive job hasn't coincided with a busy end-of-month day until now). The job is killed and the infrastructure soon returns to normal.

Previous: Spectrum of personality traits and the personality star
Next: Awesome Tools - PageSpeed Insights