Ideas for monitoring an Informix instance

Is it possible to have too much monitoring of an Informix instance? I would say no, especially if your monitoring is lightweight and efficient. Being able to pinpoint an issue quickly in a disaster means you can reduce the time it takes to fix it, save your company’s money or reputation and be a hero at the same time.

I thought it would be worthwhile to do a general blog post on monitoring, partly to see if anyone will post suggestions for things I’ve missed via the comments section. I’m not going to focus too much on the exact mechanisms, although I will give a few hints here and there. Suffice to say you can do a lot with the various onstats, some simple queries on the system tables and some bash, perl, awk and greping to bind it all together. There are also powerful commercial monitoring tools, most notably AGS Server Studio, although personally I find there isn’t a single tool that fulfils all my monitoring needs.

Let’s start with the simplest monitor of all: is your instance up? One effective way of monitoring this is via the return code from onstat –. If it’s 5 all is well. But perhaps there’s a problem with the database listener which this won’t detect so maybe you should also check that you can connect to your database via TCP/IP and run a simple (and quick) query.

Once you’ve established that your instance is online, you can check for other problems such as:

  • An excessive number of sessions waiting on locks, a logical log buffer or buffers via onstat -u.
  • The number of locks in use. I would avoid querying syslocks or running onstat -k because this can become inefficient when there’s a large number of locks. Using onstat -u instead is usually good enough and doesn’t run into scalability difficulties.
  • Locks with waiters.
  • Logical logs not backed up.
  • Backup failures or storage snapshot issues. It’s good to monitor this independently rather than rely on the success or failure of your backup script.
  • Replication status for HDR and RSS servers.
  • SQL error conditions encountered by your applications.
  • Long transactions and how many logical logs are needed to support a particular transaction.
  • The number of memory segments and free memory reported by onstat -g seg.
  • Memory used by CPU VP caches. I have a separate blog post on this.
  • Error messages in the online log. I have a separate blog post on this topic too.
  • Error messages in the console log.
  • Number of ready threads, indicating a lack of available CPU VPs.
  • Dbspaces running out of unallocated space.
  • Any chunks that are down.
  • Excessively long checkpoints.
  • Out of date statistics or distributions.
  • Query plan changes and slow-running queries (this one is tricky).

It’s worth being aware of the various system limits you can run up against on larger systems:

  • Tables and indices reaching the maximum number of extents, particularly in dbspaces with a 2 kb page size.
  • Tables approaching the maximum number of data pages of 16,777,216. In not so recent versions it’s also necessary to check indices for the same limit but now the limit for detached indices is much larger (certainly it is in 11.70.FC7W1).
  • Auto-numbering fields running out of bits, particularly serials. (This may not be a problem for you if you don’t mind the next serial number wrapping around.)

It’s essential to work with your sys admins and network technicians who should have a good idea of what the operating system is doing. Often a high load average can be the first sign of a problem. They may be keeping a record of system performance that allows you to look back in time to when things were working better.

Lastly, a lot cane be discovered about your system by periodically reviewing performance data for the number of locks in use, sequential scans and so on. This is a different type of monitoring to what I’ve covered mostly here, which is about detecting issues when they occur, but sometimes problems can only be spotted by analysing the general system performance over time.

I think this is a pretty good list. Some of them are complete topics in themselves and worthy of separate blog posts.

There may be other things that you can do specific to your set up. I have a few monitors for specific bugs and the status of their workarounds, some of which are no longer needed. You may also choose to monitor the health of your application. For example, if there are message queues of some sort, are they being flushed quickly enough? Is your application regularly inserting data into its journal or log file as you expect?

My goal is to be able to detect any scenario it’s possible to predict, especially where we’ve had problems in the past. This implies that my monitoring needs to evolve and be improved over time.

Good and comprehensive monitoring should make your life as a DBA easier. It allows you to point others to your monitoring tools and gives others confidence in them if they have doubts about whether the database is working properly. It saves you having to dig around, running manual health checks or similar.

So then, anything I’ve missed?


Monitoring virtual segment usage and the CPU VP caches

A little while ago I was forced to look into detail into the memory usage of one of the production instances I look after. Specifically the problem was that the instance was allocating extra virtual segments to itself (via the SHMADD process) but it was a surprise because memory usage on this instance was being monitored and it clearly showed that memory usage was normal and well below the initial segment size.

Well, almost everything did. onstat -g seg was correctly reporting the memory usage. Without wishing to put in an early spoiler, this is the only way of seeing how much memory your system is using and how close you are to your instance allocating an extra segment.

With the help of a careful analysis by IBM Informix support, we looked into memory usage on our system using:

onstat -u
onstat -g mem
onstat -g mgm
onstat -g ses

We also drilled down into individual sessions using:

onstat -g afr
onstat -g ffr
onstat -g ufr

We also looked into memory usage by onbar and SQLTRACE.

The result was a massive discrepancy between the total memory in use, as reported by these tools, and what onstat -g seg was reporting. And onstat -g seg appeared to be right because when it said the memory was all used up, the engine would allocate another segment.

So where was the memory going? Was it a leak or a bug? Well no, it was a new feature or, as we later learned, a bug fix.

In response to a performance issue reported by another customer, IBM had redesigned the VP memory cache code in 11.70.FC7W1 and it turned out that this was responsible for the usage of up to 70% of the virtual memory segment on our system. This is an enterprise edition only feature so I guess if you any other edition, you can stop reading at this point and just note that monitoring the output from onstat -g seg is a great idea if you want to predict segment additions.

The CPU memory cache allocates a dedicated area of the virtual memory segment to each CPU virtual processor, which it can use exclusively without contending with other virtual processors.

So a new and undocumented feature in a W1 release? Isn’t this a bit irregular? Well, no said IBM, it was a fix for a customer issue. But it does change the behaviour from what is documented quite dramatically.

In 11.70.FC7 and before, the CPU memory cache size is controlled by the VP_MEMORY_CACHE_KB parameter and if you have, say 8 CPU VPs, this results in a fixed area of memory of 8x VP_MEMORY_CACHE_KB being allocated to CPU memory caches and this is still how the manual says it works.

In 11.70.FC7W1 this parameter merely controls the initial size of these caches, which are then free to grow (and I think shrink below the initial size) as they see fit. To improve performance memory can be allocated to these caches without having to free any first and a separate thread deals with garbage collection (or drainage). (I hope I have explained this properly as I an not a programmer.) What is certain is that if your system is very busy the caches grow faster than the garbage collection clears them down. If your system is very busy for a sustained period, they can grow and allocate memory until you hit SHMTOTAL, if you’ve set it. (I don’t think hitting this limit would be very pretty because the instance would kick out sessions to free up memory, but this is not where the problem lies. Anyway, it would need testing and I haven’t done so.)

So can you monitor it? Yes you can and I’d recommend if you’re running 11.70.FC7W1 or above and have the VP cache switched on that you do. This little code snippet does the job of calculating the total size in Mb of all the VP caches for an instance:

vpcache=(`onstat -g vpcache | grep -E '^    [0-9]' | awk '{ print $2 }'`)
for ((i=0; i<${#vpcache[*]}; i++)); do
    vpcacheblks=`expr $vpcacheblks + ${vpcache[$i]}`
vpcachemb=`echo 'scale=1; '${vpcacheblks}' / 256' | bc`
echo $vpcachemb

You can also use the output from onstat -g vpcache to work out at the number of missed drains using the formula (free - alloc) - drains.

If you have a busy system and particularly one with heavy peak periods, graphing the size of this over time is very interesting. Equally if your system is not that busy, you may see flat memory usage. It’s worth knowing which applies to you.

So if you’re reading this article and, having done a bit of investigation on your own system to see whether it affects you and found that it does, what can you do to mitigate? Here are some options:

  • Downgrade to 11.70.FC7 or earlier.
  • Set VP_MEMORY_CACHE_KB to 0. You can actually do this dynamically using onmode -wm to clear out the cache and then reset it to its original value immediately afterwards.
  • Increase SHMVIRTSIZE to accommodate the growth. Of course you need spare memory in your server to do this.
  • Set SHMTOTAL to remove the possibility of your server swapping. If you do, also look at setting up the low memory manager.

So what are IBM doing about the situation? An APAR has been raised as follows:


This should result (in 11.70.FC8) in the documentation being updated and a choice of STATIC or DYNAMIC modes will be available for the VP caches. DYNAMIC will be the same as the new behaviour and STATIC is more similar to how things were previously where the VP caches were a fixed size. Note I said more similar and not the same. It will be interesting to look at how this behaves when it’s available.

There’s also another issue touched on here and I’ve used the new request for enhancement (RFE) site to log it and that is that onstat -g mem does not include the VP cache sizes in its output and is therefore not a complete view of all the memory pools in your instance. The RFE requests that it is.

Parsing the online log

At a recent IBM roadshow there was a brief discussion where someone mentioned that they monitor their instances using a script and regular expressions to parse the online log. Using such an approach is quite straightforward, although it is somewhat tedious to code for all the possible combinations.

You can actually find out most, if not all of the possible messages, by running:

strings $INFORMIXDIR/msg/en_us/0333/olmsglog.iem

However, there is an easier way. Nearly always when a message is written to the online log, the alarmprogram is called. If you’re familiar with the alarmprogram, you’ll know that when it’s called a severity value is passed through which you can use as the basis for if you’re alerted by email or not. This severity value is never shown in the online log so already you can see that other approaches might have more potential.

There is now a scheduler task called post_alarm_message, which writes online log messages to a table called ph_alert in the sysadmin database.

> dbaccess sysadmin -

Database selected.

> select tk_name, tk_description, tk_execute, tk_enable from ph_task where tk_name='post_alarm_message';

tk_name post_alarm_message
tk_description System function to post alerts
tk_execute ph_dbs_alert
tk_enable t

1 row(s) retrieved.

I guess this is there primarily for OAT but it’s extremely useful for system monitoring because it preserves some information about the severity of the alerts and also makes them easy to query via SQL.

> select * from ph_alert where alert_time > current - 30 units minute;

id 9349
alert_task_id 18
alert_task_seq 4970
alert_type INFO
alert_color YELLOW
alert_time 2013-10-02 19:30:06
alert_state NEW
alert_state_chang+ 2013-10-02 19:30:06
alert_object_type ALARM
alert_object_name 23
alert_message Logical Log 14294 Complete, timestamp: 0x7ab0345c.
alert_action_dbs sysadmin
alert_object_info 23001

id 9350
alert_task_id 18
alert_task_seq 4971
alert_type INFO
alert_color YELLOW
alert_time 2013-10-02 19:49:25
alert_state NEW
alert_state_chang+ 2013-10-02 19:49:25
alert_object_type ALARM
alert_object_name 23
alert_message Logical Log 14295 Complete, timestamp: 0x7ad5b988.
alert_action_dbs sysadmin
alert_object_info 23001

2 row(s) retrieved.

Note the alarm types and colours. Personally I don’t set much store by the alert colour; I prefer to go by the alarm type where the possibilities are INFO, WARNING and ERROR. These are not the same as the alarmprogram severities which range from 1 to 5.

For monitoring purposes, a sensible query to look for alerts might be something like:

select alert_time, alert_color, alert_type, alert_object_type, alert_message from ph_alert where alert_type!='INFO' and alert_state='NEW' and alert_time > current - 7 units day order by alert_time;

Using dbaccess or an Informix API for your favourite scripting language, you can monitor for alerts quite easily.

You can mark the alerts as acknowledged using a query like the below:

update ph_alert set alert_state='ACKNOWLEDGED' where id=? and alert_state='NEW' and alert_type in ('WARNING', 'ERROR');

Hopefully I’ve demonstrated that this approach is a lot easier and probably better than a complex regex script. Maybe you shouldn’t throw away that script just yet and run both in parallel until you’re satisfied of the reliability of this approach? To make sure that the post_alarm_message process is working I also check that there are entries in the ph_alert table on a regular basis, although on a quiet system there may not be any entries for some time.

There’s also the console log to consider, which you may want to monitor and it cannot be implemented in this way.