Is it possible to have too much monitoring of an Informix instance? I would say no, especially if your monitoring is lightweight and efficient. Being able to pinpoint an issue quickly in a disaster means you can reduce the time it takes to fix it, save your company’s money or reputation and be a hero at the same time.
I thought it would be worthwhile to do a general blog post on monitoring, partly to see if anyone will post suggestions for things I’ve missed via the comments section. I’m not going to focus too much on the exact mechanisms, although I will give a few hints here and there. Suffice to say you can do a lot with the various onstats, some simple queries on the system tables and some bash, perl, awk and greping to bind it all together. There are also powerful commercial monitoring tools, most notably AGS Server Studio, although personally I find there isn’t a single tool that fulfils all my monitoring needs.
Let’s start with the simplest monitor of all: is your instance up? One effective way of monitoring this is via the return code from onstat –. If it’s 5 all is well. But perhaps there’s a problem with the database listener which this won’t detect so maybe you should also check that you can connect to your database via TCP/IP and run a simple (and quick) query.
Once you’ve established that your instance is online, you can check for other problems such as:
- An excessive number of sessions waiting on locks, a logical log buffer or buffers via onstat -u.
- The number of locks in use. I would avoid querying syslocks or running onstat -k because this can become inefficient when there’s a large number of locks. Using onstat -u instead is usually good enough and doesn’t run into scalability difficulties.
- Locks with waiters.
- Logical logs not backed up.
- Backup failures or storage snapshot issues. It’s good to monitor this independently rather than rely on the success or failure of your backup script.
- Replication status for HDR and RSS servers.
- SQL error conditions encountered by your applications.
- Long transactions and how many logical logs are needed to support a particular transaction.
- The number of memory segments and free memory reported by onstat -g seg.
- Memory used by CPU VP caches. I have a separate blog post on this.
- Error messages in the online log. I have a separate blog post on this topic too.
- Error messages in the console log.
- Number of ready threads, indicating a lack of available CPU VPs.
- Dbspaces running out of unallocated space.
- Any chunks that are down.
- Excessively long checkpoints.
- Out of date statistics or distributions.
- Query plan changes and slow-running queries (this one is tricky).
It’s worth being aware of the various system limits you can run up against on larger systems:
- Tables and indices reaching the maximum number of extents, particularly in dbspaces with a 2 kb page size.
- Tables approaching the maximum number of data pages of 16,777,216. In not so recent versions it’s also necessary to check indices for the same limit but now the limit for detached indices is much larger (certainly it is in 11.70.FC7W1).
- Auto-numbering fields running out of bits, particularly serials. (This may not be a problem for you if you don’t mind the next serial number wrapping around.)
It’s essential to work with your sys admins and network technicians who should have a good idea of what the operating system is doing. Often a high load average can be the first sign of a problem. They may be keeping a record of system performance that allows you to look back in time to when things were working better.
Lastly, a lot cane be discovered about your system by periodically reviewing performance data for the number of locks in use, sequential scans and so on. This is a different type of monitoring to what I’ve covered mostly here, which is about detecting issues when they occur, but sometimes problems can only be spotted by analysing the general system performance over time.
I think this is a pretty good list. Some of them are complete topics in themselves and worthy of separate blog posts.
There may be other things that you can do specific to your set up. I have a few monitors for specific bugs and the status of their workarounds, some of which are no longer needed. You may also choose to monitor the health of your application. For example, if there are message queues of some sort, are they being flushed quickly enough? Is your application regularly inserting data into its journal or log file as you expect?
My goal is to be able to detect any scenario it’s possible to predict, especially where we’ve had problems in the past. This implies that my monitoring needs to evolve and be improved over time.
Good and comprehensive monitoring should make your life as a DBA easier. It allows you to point others to your monitoring tools and gives others confidence in them if they have doubts about whether the database is working properly. It saves you having to dig around, running manual health checks or similar.
So then, anything I’ve missed?