Monitoring Informix with the Elastic Stack

Introduction

If you’re not familiar with the Elastic Stack it is a suite of products for ingesting data or logs, searching, analysing and visualising. There is a good overview over at the Elastic web site of how it can be put together. I say “can” because the stack is very flexible and, for example, you can send JSON documents to Elasticsearch via a REST API, rather than use Filebeat or Logstash.

This blog post is mostly concerned with ingesting the Informix online log with Filebeat, recognising certain types of log line that can occur and tagging the file using rules set up in Logstash, before sending it to Elasticsearch for storage and indexing. Finally Kibana can be used to visualise the data stored in Elasticsearch.

It’s easy to see how this could be scaled up to provide a single place to visualise logs from multiple instances and it would be fairly trivial to add in other logs too, like the Informix bar logs and logs from the operating system.

At IIUG 2018 in Arlington, VA I presented a talk entitled DevOps for DBAs, which demonstrated the Docker set up now described below but I hadn’t at the time documented the full set up at the time. Here it is!

Practical demonstration with Docker containers

Overview

This demonstration sets up two containers: one running Informix Developer Edition and Filebeat to collect and ship logs:

  • Informix 12.10.FC12W1DE, listening on port 9088/tcp for onsoctcp connections.
  • Filebeat 6.5.2.

and the other running the Elasticstack components as follows:

  • Logstash 6.5.2, listening on port 5044/tcp.
  • Elasticsearch 6.5.2, listening on port 9200/tcp.
  • Kibana 6.5.2, listening on port 5601/tcp.

Access to Kibana is via your favourite web browser running on your desktop. Nginx will be listening on port 80 so you can simply access http://localhost/

For a secure production implementation it’s recommended that you use Nginx with HTTPS as a reverse proxy for the Kibana web service as shown in the diagram. We’ll be using Nginx in this demonstration, rather than connecting to Kibana directly, but we won’t be configuring SSL; there are plenty of online guides about how to do this. Also communication between Filebeat and Logstash should be encrypted: this blog post doesn’t cover this.

The above versions are current at the time of writing (December 2018). Elasticstack moves very quickly so it is likely these will not be the latest versions by the time you read this. The idea of this blog post is that you should just be able to copy and paste the commands and end up with a working system but you shouldn’t be surprised if things don’t work perfectly if your versions don’t match the above. For example, in between beginning this blog post and finishing it, version 6.5.x was released with improved default security settings, with services only listening on the loopback interface without reconfiguration.

Running the whole Elasticstack in Docker plus Informix does require a reasonable amount of memory and I’d suggest a minimum of 2.5 GB to be allocated to the Docker Engine.

Docker network

To provide name resolution between the containers we are going to start by creating a docker network:

docker network create --driver bridge my_informix_elk_stack

Elastic Stack installation

Docker container

We’ll start by setting up the Elastic Stack Docker container which will be using on a (minimal) Debian installation. In a terminal run:

docker pull debian
docker run -it --name elasticstack_monitoring -p 80:80 -p 5044:5044 --hostname elasticstack -e "GF_SECURITY_ADMIN_PASSWORD=secret" --net my_informix_elk_stack debian

Your terminal should now be inside the Docker container and logged in as root.

To avoid some issues with debconf when installing packages run:

echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

Install Java

Run these commands to install some the software-properties-common package and then install OpenJDK 8, which is required by Elasticsearch and Logstash. Java 9 should be fine too.

The Debian Docker image does not come with many packages pre-installed so I am also going to install vim for editing files later plus a few other essentials; you may prefer nano or another editor.

apt-get update
apt-get install software-properties-common gnupg vim wget apt-transport-https openjdk-8-jre

The Debian Docker container is a basic installation so this short list of packages have hundreds of dependencies.

Check the Java version:

# java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-2~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Install Elasticsearch

The Elasticsearch installation is more straightforward than Oracle Java and follows standard Linux methods for setting up and installing from a third party software repository.

First we install the repository’s key:

wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -

Then add the repository:

echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-6.x.list

We need the HTTPS transport package installing before we can proceed to install.

apt-get update
apt-get install elasticsearch

Elasticsearch will work right out of the box which is fine for the purposes of this demonstration and (after we start the service) will be listening on localhost only on ports 9200 for the REST API and 9300 for node communication.

Install Kibana

This is installed from the Elasticstack repository added above:

apt-get install kibana

Again Kibana doesn’t require any reconfiguration for the purposes of this demonstration and will listen on localhost only on port 5601.

Start Kibana by running:

service kibana start

Now that Kibana is running start Elasticsearch by running:

service elasticsearch start

It’s worth noting as this point that modern Debian distributions use systemd but this doesn’t work in non-privileged Docker containers. For reference the systemd equivalents for this are:

systemctl daemon-reload
systemctl enable kibana
systemctl enable elasticsearch
systemctl start kibana
systemctl start elasticsearch

This commands also ensure the service starts on boot.

As Kibana is only listening on localhost and therefore unreachable from an external web browser, we will set up Nginx as a reverse proxy. This is a more secure configuration and recommended for any production implementation because only Nginx is directly exposed to the internet and not Kibana.

Install Nginx as a reverse proxy

Start by installing Nginx:

apt-get install nginx

Edit the file /etc/nginx/sites-available/default and in the location / section add the line:

proxy_pass http://localhost:5601;

Comment out the line beginning with try_files.

It should look something like this:

    location / {
        # First attempt to serve request as file, then
        # as directory, then fall back to displaying a 404.
        proxy_pass http://localhost:5601;
        #try_files $uri $uri/ =404;
    }

Don’t forget the semi-colon!

Start nginx:

service nginx start

Install Logstash

An installation procedure that by now should look familiar:

apt-get install logstash

Our Logstash configuration will be in two parts:

  1. A standard out of the box configuration for receiving files from Filebeat and sending them to Elasticsearch: for this copy /etc/logstash/logstash-sample.conf to /etc/logstash/conf.d/logstash.conf
  2. A custom config file, /etc/logstash/conf.d/informix.conf, for parsing the Informix online log.

I intend to update and improve the Logstash config for filtering the Informix online log and it’s available on my Github page at https://github.com/skybet/informix-helpers/blob/master/logstash/informix.conf. Download it locally and then you can copy it to your Docker container as follows:

docker cp informix.conf elasticstack_monitoring:/etc/logstash/conf.d/informix.conf

The Informix config file requires that Filebeat tags the file with “[field][informix] = true“; this condition is trivial to remove if you wish.

Check the Logstash configuration with:

/usr/share/logstash/bin/logstash -t --path.settings /etc/logstash

Finally start Logstash with:

/usr/share/logstash/bin/logstash --path.settings /etc/logstash

You could also use systemd to do this.

Informix installation

Informix Developer Edition Docker container

Now we are going to set up the Informix container to monitor. On your workstation in another terminal run:

$ docker pull ibmcom/informix-developer-database
docker run -it --name iif_developer_edition --privileged -p 9088:9088 -p 9089:9089 -p 27017:27017 -p 27018:27018 -p 27883:27883 --hostname ifxserver --net my_informix_elk_stack -e LICENSE=accept ibmcom/informix-developer-database:latest

It’s worth noting that if you exit the shell the Informix DE Docker container will stop. You can start it again with:

docker start iif_developer_edition -i

This latest version containing 12.10.FC12W1 doesn’t return you to the prompt after the engine starts, so you’ll need to open an interactive container in another terminal window as follows:

docker exec -it iif_developer_edition /bin/bash

Now both Docker containers are running you should be able to test name resolution and connectivity both ways with ping.

From the Informix container:

informix@ifxserver:/$ ping elasticstack_monitoring

From the Elastic Stack container:

root@elasticstack2:/# ping iif_developer_edition

These names belong to the containers and are not necessarily their host names.

While you’re still logged in as Informix set MSG_DATE to 1 which is required for my Logstash configuration:

onmode -wf MSG_DATE=1

This means (nearly) all online log messages will be prefixed with the date (MM/DD/YY format) and time (HH:MM:SS format).

You’ll be logged in as user informix which can sudo to root as follows:
sudo -i

Install Filebeat

In the Informix container it’s more of the same to install Filebeat:

apt-get update
apt-get install vim wget apt-transport-https
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-6.x.list
apt-get update
apt-get install filebeat

Filebeat’s configuration file is /etc/filebeat/filebeat.yml. If you’re not familiar with yml files correct indentation with two spaces per level is absolutely essential; yml files rely on this for structure instead of any braces or brackets. Lines beginning with a dash indicate an array.

onstat -c | grep MSGPATH reveals that the Informix online log file resides at /opt/ibm/data/logs/online.log and we want Filebeat to ingest this and pass it to Logstash running in the other container for parsing.

The Informix online log quite often contains wrapped lines and these generally don’t start with a timestamp or date of any kind.

Edit the file and add the following configuration directly under filebeat.inputs:

- type: log
  enabled: true
  paths:
    - /opt/ibm/data/logs/online.log
  fields:
    informix: true
  multiline.pattern: '^[0-9][0-9]'
  multiline.negate: true
  multiline.match: after

Finally set up the output by commenting out (add a ‘#’) to all parts of output.elasticsearch and uncommenting section output.logstash. Set hosts in this section to [“elasticstack_monitoring:5044”].

Start filebeat by running:

service filebeat start

You should see the message: Config OK.

Using Kibana

You should now be able to log into Kibana on http://localhost/ and add an index on filebeat*.

Then in the Discover section you should be able to see Informix log lines coming into the system and how they have been tagged.

If you wish to improve and test the parsing of particular log entries it is simple enough to create ones yourself in the Informix container like this:

echo "12/08/18 17:00:00 My Informix log line" >> /opt/ibm/data/logs/online.log

This particular blog post is going to end shortly. Kibana is a big subject and at this stage of my Elasticstack DevOps journey, I don’t feel qualified to write a blog on the subject. Once I’ve progressed further I may do a follow-on post.

The tagging of events in the online log like the checkpoint duration and its assignment to variable informix.ckpt_duration should allow you do easily do searches based on this and visualise them in dashboards.

Good luck!

Advertisements

Extent size doubling

This post was inspired a discussion one of the IIUG Technical Special Interest Groups, titled Outrageous Extent sizes.

When you write to an Informix table or index (referred to collectively as an “object”) the Informix engine writes to an extent, a continuous allocation of disk of a fixed size dedicated to that table or index. Informix has a limit on the number of extents an object may have which depends on the page size: around 200 extent for a 2 kB dbspace rising to around 2000 for a 16 kB dbspace.

To ensure space is not wasted new extents start small and then the engine increases the space it allocates as the table grows.

In versions of Informix now unsupported the algorithm was to allocate four extents of the initial extent size and then start using the next extent size. The next extent size doubled every sixteen allocations. Unfortunately this algorithm was slow to “get going” and objects, especially those in 2 kB dbspaces, could quickly gain a lot of extents and approach the extent limit.

To avoid this problem the DBA could modify the first and next extent sizes to suit, which was useful especially if you could anticipate future growth.

In recent versions of Informix the algorithm for allocating extents to tables has changed and become more aggressive, the result being less intervention required by a DBA but perhaps unwanted surprises when large amounts of unallocated space are assigned to objects whenever a new extent is created.

The Informix manual states with regard to extent size doubling:

For permanent tables or user-defined temporary tables, the size of the next extent for every allocation is automatically doubled. The size doubles up to 128 kilobytes (KB). For example, if you create a table with the NEXT SIZE equal to 15 KB, the database server allocates the first extent at a size of 15 KB. The next extent is allocated at 30 KB, and the extent after that is allocated at 60 KB. When the extent size reaches 128 KB, the size is doubled only when the remaining space in the table is less than 10% of the total allocated space in the table.

For system-created temporary tables, the next-extent size begins to double after 4 extents have been added.

I found this explanation kind of confusing and – as we’ll see – it’s slightly wrong anyway.

As ever I am going to use Perl, a DBA favourite, to investigate what is happening. I am going to create a simple table as follows:

create table t1 (
myid bigserial not null,
mydata varchar(254) not null
) in my4kspace extent size 16 next size 16 ;

create unique index ui_t1 on t1 (myid) in my4kspace;

Then what I am going to do is insert random data one row at a time into the mydata column and, after every insert, check the next extent size used by referencing the partnum in sysmaster:sysptnhdr. If a new extent is added or the next extent size changes I’m going to print a one-line summary and do this for both the table and its index.

It’s worth mentioning that systnphdr records the actual extent sizes being used, not systables which contains the value last set using SQL or when the table was created.

Here are my results:

TABLE: Next size: 16 kB (4 pages); nptotal: 4 (16 kB); nrows: 1
INDEX: Next size: 16 kB (4 pages); nptotal: 4 (16 kB)
TABLE: Next size: 32 kB (8 pages); nptotal: 8 (32 kB); nrows: 98; ratio: 100.00 (16 kB -> 32 kB)
TABLE: Next size: 64 kB (16 pages); nptotal: 16 (64 kB); nrows: 207; ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 128 kB (32 pages); nptotal: 20 (80 kB); nrows: 424; ratio: 100.00 (64 kB -> 128 kB)
INDEX: Next size: 32 kB (8 pages); nptotal: 8 (32 kB); ratio: 100.00 (16 kB -> 32 kB)
TABLE: Next size: 256 kB (64 pages); nptotal: 52 (208 kB); nrows: 529; ratio: 160.00 (128 kB -> 256 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 116 (464 kB); nrows: 1402; ratio: 123.08 (256 kB -> 512 kB)
INDEX: Next size: 64 kB (16 pages); nptotal: 16 (64 kB); ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 244 (976 kB); nrows: 3157; ratio: 110.34
INDEX: Next size: 128 kB (32 pages); nptotal: 32 (128 kB); ratio: 100.00 (64 kB -> 128 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 372 (1488 kB); nrows: 6636; ratio: 52.46
INDEX: Next size: 256 kB (64 pages); nptotal: 64 (256 kB); ratio: 100.00 (128 kB -> 256 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 500 (2000 kB); nrows: 10162; ratio: 34.41
TABLE: Next size: 512 kB (128 pages); nptotal: 628 (2512 kB); nrows: 13697; ratio: 25.60
INDEX: Next size: 512 kB (128 pages); nptotal: 128 (512 kB); ratio: 100.00 (256 kB -> 512 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 756 (3024 kB); nrows: 17255; ratio: 20.38
TABLE: Next size: 512 kB (128 pages); nptotal: 884 (3536 kB); nrows: 20741; ratio: 16.93
TABLE: Next size: 512 kB (128 pages); nptotal: 1012 (4048 kB); nrows: 24332; ratio: 14.48
TABLE: Next size: 512 kB (128 pages); nptotal: 1140 (4560 kB); nrows: 27883; ratio: 12.65
INDEX: Next size: 512 kB (128 pages); nptotal: 256 (1024 kB); ratio: 100.00
TABLE: Next size: 512 kB (128 pages); nptotal: 1268 (5072 kB); nrows: 31463; ratio: 11.23
TABLE: Next size: 512 kB (128 pages); nptotal: 1396 (5584 kB); nrows: 34921; ratio: 10.09
TABLE: Next size: 1024 kB (256 pages); nptotal: 1524 (6096 kB); nrows: 38471; ratio: 9.17 (512 kB -> 1024 kB)
TABLE: Next size: 1024 kB (256 pages); nptotal: 1780 (7120 kB); nrows: 41965; ratio: 16.80
TABLE: Next size: 1024 kB (256 pages); nptotal: 2036 (8144 kB); nrows: 49007; ratio: 14.38
TABLE: Next size: 1024 kB (256 pages); nptotal: 2292 (9168 kB); nrows: 56089; ratio: 12.57
INDEX: Next size: 512 kB (128 pages); nptotal: 384 (1536 kB); ratio: 50.00
TABLE: Next size: 1024 kB (256 pages); nptotal: 2548 (10192 kB); nrows: 63151; ratio: 11.17
TABLE: Next size: 1024 kB (256 pages); nptotal: 2804 (11216 kB); nrows: 70325; ratio: 10.05
TABLE: Next size: 2048 kB (512 pages); nptotal: 3060 (12240 kB); nrows: 77402; ratio: 9.13 (1024 kB -> 2048 kB)
TABLE: Next size: 2048 kB (512 pages); nptotal: 3572 (14288 kB); nrows: 84473; ratio: 16.73
INDEX: Next size: 512 kB (128 pages); nptotal: 512 (2048 kB); ratio: 33.33
TABLE: Next size: 2048 kB (512 pages); nptotal: 4084 (16336 kB); nrows: 98674; ratio: 14.33
TABLE: Next size: 2048 kB (512 pages); nptotal: 4596 (18384 kB); nrows: 112742; ratio: 12.54
INDEX: Next size: 512 kB (128 pages); nptotal: 640 (2560 kB); ratio: 25.00
TABLE: Next size: 2048 kB (512 pages); nptotal: 5108 (20432 kB); nrows: 126934; ratio: 11.14
TABLE: Next size: 2048 kB (512 pages); nptotal: 5620 (22480 kB); nrows: 141188; ratio: 10.02
INDEX: Next size: 512 kB (128 pages); nptotal: 768 (3072 kB); ratio: 20.00
TABLE: Next size: 4096 kB (1024 pages); nptotal: 6132 (24528 kB); nrows: 155312; ratio: 9.11 (2048 kB -> 4096 kB)
TABLE: Next size: 4096 kB (1024 pages); nptotal: 7156 (28624 kB); nrows: 169379; ratio: 16.70
INDEX: Next size: 512 kB (128 pages); nptotal: 896 (3584 kB); ratio: 16.67
TABLE: Next size: 4096 kB (1024 pages); nptotal: 8180 (32720 kB); nrows: 197862; ratio: 14.31
INDEX: Next size: 512 kB (128 pages); nptotal: 1024 (4096 kB); ratio: 14.29
TABLE: Next size: 4096 kB (1024 pages); nptotal: 9204 (36816 kB); nrows: 226153; ratio: 12.52
INDEX: Next size: 512 kB (128 pages); nptotal: 1152 (4608 kB); ratio: 12.50
TABLE: Next size: 4096 kB (1024 pages); nptotal: 10228 (40912 kB); nrows: 254610; ratio: 11.13
INDEX: Next size: 512 kB (128 pages); nptotal: 1280 (5120 kB); ratio: 11.11
TABLE: Next size: 4096 kB (1024 pages); nptotal: 11252 (45008 kB); nrows: 282889; ratio: 10.01
INDEX: Next size: 512 kB (128 pages); nptotal: 1408 (5632 kB); ratio: 10.00
TABLE: Next size: 8192 kB (2048 pages); nptotal: 12276 (49104 kB); nrows: 311209; ratio: 9.10 (4096 kB -> 8192 kB)
INDEX: Next size: 1024 kB (256 pages); nptotal: 1536 (6144 kB); ratio: 9.09 (512 kB -> 1024 kB)
TABLE: Next size: 8192 kB (2048 pages); nptotal: 14324 (57296 kB); nrows: 339501; ratio: 16.68
INDEX: Next size: 1024 kB (256 pages); nptotal: 1543 (6172 kB); ratio: 16.67
INDEX: Next size: 1024 kB (256 pages); nptotal: 1799 (7196 kB); ratio: 16.59

What I observe here for this table in a 4 kB space with an index also in a 4 kB space is:

  • The initial extent is created using the first extent size.
  • The second extent is created using the next extent size.
  • The next extent size doubles every time the table needs more space up to 128 pages, not kB as stated in the manual.
  • The extent size then doubles if the next extent size is less than 10% of the size of the object.
  • Identical logic applies to both tables and indices.

It’s easy to get confused between kB and pages here which is why I am showing both.

Just to show the 128 pages works for other page sizes, here is the same output using an 8 kB dbspace:

TABLE: Next size: 32 kB (4 pages); nptotal: 4 (32 kB); nrows: 1
INDEX: Next size: 32 kB (4 pages); nptotal: 4 (32 kB)
TABLE: Next size: 64 kB (8 pages); nptotal: 8 (64 kB); nrows: 172; ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 128 kB (16 pages); nptotal: 16 (128 kB); nrows: 410; ratio: 100.00 (64 kB -> 128 kB)
TABLE: Next size: 256 kB (32 pages); nptotal: 32 (256 kB); nrows: 877; ratio: 100.00 (128 kB -> 256 kB)
INDEX: Next size: 64 kB (8 pages); nptotal: 8 (64 kB); ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 512 kB (64 pages); nptotal: 64 (512 kB); nrows: 1776; ratio: 100.00 (256 kB -> 512 kB)
INDEX: Next size: 128 kB (16 pages); nptotal: 16 (128 kB); ratio: 100.00 (64 kB -> 128 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 128 (1024 kB); nrows: 3570; ratio: 100.00 (512 kB -> 1024 kB)
INDEX: Next size: 256 kB (32 pages); nptotal: 32 (256 kB); ratio: 100.00 (128 kB -> 256 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 256 (2048 kB); nrows: 7214; ratio: 100.00
INDEX: Next size: 512 kB (64 pages); nptotal: 64 (512 kB); ratio: 100.00 (256 kB -> 512 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 384 (3072 kB); nrows: 14466; ratio: 50.00
TABLE: Next size: 1024 kB (128 pages); nptotal: 512 (4096 kB); nrows: 21756; ratio: 33.33
TABLE: Next size: 1024 kB (128 pages); nptotal: 640 (5120 kB); nrows: 29041; ratio: 25.00
INDEX: Next size: 1024 kB (128 pages); nptotal: 128 (1024 kB); ratio: 100.00 (512 kB -> 1024 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 768 (6144 kB); nrows: 36175; ratio: 20.00
TABLE: Next size: 1024 kB (128 pages); nptotal: 896 (7168 kB); nrows: 43515; ratio: 16.67
TABLE: Next size: 1024 kB (128 pages); nptotal: 1024 (8192 kB); nrows: 50751; ratio: 14.29
TABLE: Next size: 1024 kB (128 pages); nptotal: 1152 (9216 kB); nrows: 58019; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 256 (2048 kB); ratio: 100.00
TABLE: Next size: 1024 kB (128 pages); nptotal: 1280 (10240 kB); nrows: 65372; ratio: 11.11
TABLE: Next size: 1024 kB (128 pages); nptotal: 1408 (11264 kB); nrows: 72545; ratio: 10.00
TABLE: Next size: 2048 kB (256 pages); nptotal: 1536 (12288 kB); nrows: 79871; ratio: 9.09 (1024 kB -> 2048 kB)
TABLE: Next size: 2048 kB (256 pages); nptotal: 1792 (14336 kB); nrows: 87169; ratio: 16.67
TABLE: Next size: 2048 kB (256 pages); nptotal: 2048 (16384 kB); nrows: 101742; ratio: 14.29
TABLE: Next size: 2048 kB (256 pages); nptotal: 2304 (18432 kB); nrows: 116352; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 278 (2224 kB); ratio: 50.00
TABLE: Next size: 2048 kB (256 pages); nptotal: 2560 (20480 kB); nrows: 130862; ratio: 11.11
INDEX: Next size: 1024 kB (128 pages); nptotal: 406 (3248 kB); ratio: 46.04
TABLE: Next size: 2048 kB (256 pages); nptotal: 2816 (22528 kB); nrows: 145461; ratio: 10.00
TABLE: Next size: 4096 kB (512 pages); nptotal: 3072 (24576 kB); nrows: 160073; ratio: 9.09 (2048 kB -> 4096 kB)
TABLE: Next size: 4096 kB (512 pages); nptotal: 3584 (28672 kB); nrows: 174651; ratio: 16.67
INDEX: Next size: 1024 kB (128 pages); nptotal: 534 (4272 kB); ratio: 31.53
TABLE: Next size: 4096 kB (512 pages); nptotal: 4096 (32768 kB); nrows: 203750; ratio: 14.29
TABLE: Next size: 4096 kB (512 pages); nptotal: 4608 (36864 kB); nrows: 232818; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 662 (5296 kB); ratio: 23.97
TABLE: Next size: 4096 kB (512 pages); nptotal: 5120 (40960 kB); nrows: 261920; ratio: 11.11
TABLE: Next size: 4096 kB (512 pages); nptotal: 5632 (45056 kB); nrows: 290967; ratio: 10.00
INDEX: Next size: 1024 kB (128 pages); nptotal: 790 (6320 kB); ratio: 19.34
TABLE: Next size: 8192 kB (1024 pages); nptotal: 6144 (49152 kB); nrows: 320100; ratio: 9.09 (4096 kB -> 8192 kB)
TABLE: Next size: 8192 kB (1024 pages); nptotal: 7168 (57344 kB); nrows: 349442; ratio: 16.67
INDEX: Next size: 1024 kB (128 pages); nptotal: 918 (7344 kB); ratio: 16.20
TABLE: Next size: 8192 kB (1024 pages); nptotal: 8192 (65536 kB); nrows: 407578; ratio: 14.29
INDEX: Next size: 1024 kB (128 pages); nptotal: 1046 (8368 kB); ratio: 13.94
TABLE: Next size: 8192 kB (1024 pages); nptotal: 9216 (73728 kB); nrows: 465592; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 1138 (9104 kB); ratio: 12.24
TABLE: Next size: 8192 kB (1024 pages); nptotal: 10240 (81920 kB); nrows: 523746; ratio: 11.11
INDEX: Next size: 1024 kB (128 pages); nptotal: 1266 (10128 kB); ratio: 11.25
TABLE: Next size: 8192 kB (1024 pages); nptotal: 11264 (90112 kB); nrows: 581740; ratio: 10.00
INDEX: Next size: 1024 kB (128 pages); nptotal: 1394 (11152 kB); ratio: 10.11
TABLE: Next size: 16384 kB (2048 pages); nptotal: 12288 (98304 kB); nrows: 639785; ratio: 9.09 (8192 kB -> 16384 kB)
INDEX: Next size: 2048 kB (256 pages); nptotal: 1522 (12176 kB); ratio: 9.18 (1024 kB -> 2048 kB)
TABLE: Next size: 16384 kB (2048 pages); nptotal: 14336 (114688 kB); nrows: 697932; ratio: 16.67

Another feature of the engine is merging adjacent extents for the same object. This excerpt from an oncheck -pe report is from the 8 kB example:

Description Offset(p) Size(p)
------------------------------------------------------------- -------- --------
dbname:'benthompson'.t1 187897 14336
dbname:'benthompson'.ui_t1 221782 278
dbname:'benthompson'.ui_t1 223004 860
dbname:'benthompson'.ui_t1 234148 384

In my example the engine has managed to merge all the extents for the table into one and has merged the index extents into three (278 + 860 + 384 = 1522).

If you want to try out the test yourself the code is in the skybet/informix-helpers github repository

.


Monitoring Informix with Grafana

Introduction

In a presentation I gave at IIUG 2017 titled Making system monitoring better I showed, without much detail, how Grafana is a powerful tool for visualising what is happening within your Informix server. Performance metrics from your database server are collected at regular (usually 10 second) intervals and stored in a time-series database which can be used as the source for dashboards containing dynamic graphs and other ways of presenting the data. For the DBA the benefits are legion:

  • Quickly change the time range to zoom into when problems occurred or zoom out to see trends.
  • Correlate various database metrics and combine then with related operating system, network, storage or application metrics.
  • Get a truer picture of your busy periods, capacity, the effect of scheduled jobs etc.
  • Faster problem resolution because a lot of data can be visualised at once.

You might be able to think of some others.

The talk also touched on the CAMS acronym:

Culture
Automation
Measurement
Sharing

So you shouldn’t keep your dashboards to yourself: share them with other technical teams or everyone in your company. This has the added benefit of more eyes and others can learn to spot database problems, or when they are probably not database problems, by referring to these.

Why Grafana?

There are a number of tools which appear to do a similar job:

You perhaps haven’t heard of Brunia: it is the code name for a prototype monitoring tool that may replace Informix Open Admin Tool (OAT) in the future. It was demonstrated at IIUG 2017 and is probably closest to Prometheus in its execution. AGS Sentinel is the monitoring add-on to the popular ServerStudio suite for Informix. The rest are popular open source tools which other teams in your organisation are probably already using.

Some of the tools listed above can also produce events or alerts when a trigger condition occurs and automatically pass this up a stack to PagerDuty or another call-out system. An example of such a stack is Prometheus -> Alertmanager -> PagerDuty -> StatusPage

There are a lot of ways of implementing a full monitoring stack with choices to make about data collection, storing, visualisation, analysis and alerting. In this blog post I am going to concentrate on a simple set up where we collect Informix metrics, pass them to InfluxDB using a REST API and visualise in Grafana. For a fuller discussion of the benefits of the three open source technologies mentioned above I highly recommend reading this blog post from Loom Systems written in June 2017, Prometheus vs. Grafana vs. Graphite – A Feature Comparison.

In case you’re not going to read the LS blog it’s worth emphasising what the InfluxDB/Grafana solution I am about to describe does not provide:

  • There is little in the way of monitoring and alerting features.
  • Regression models enabling you to predict the value of a metric in the future are not available.
  • Advanced time series functions are not available.

The solution would be richer if Graphite was used as the data source and Grafana for visualisation only. This would provide more aggregation functions and allows you to do things like subtract one time series from another. As an example of what this might provide, I have a dashboard (not covered in this blog post) displaying the buffer turnover ratio and buffer waits ratio over an arbitrary moving window irrespective of when onstat -z was last run.

It is easy to confuse Graphite and Grafana, especially as both can be used independently, or Graphite can be a data source for Grafana.

As this is an Informix blog I ought to explain why I am using InfluxDB and not Informix time series? The simple answer is that to use Informix time series with Grafana properly someone would have to write and maintain a data source plugin for it like the one for InfluxDB we’ll be using. Doing so would give something more feature rich than InfluxDB for sure but perhaps not much more powerful than a Graphite/Grafana combination.

What to monitor

Potentially anything we can put a value to every ten seconds can be collected and stored in InfluxDB (which is a statement you can make about time series collections in general). For Linux operating system metrics there is a well-established collection daemon called collectd and, if I had better C programming skills, I could a collectd plugin for Informix.

For Informix systems the most obvious source is the system monitoring interface (SMI) which is the presentation of information held in shared memory through pseudo-tables in the sysmaster database. This covers the vast majority of what can be collected using onstat but is easier to handle in a programming language. Doing it this way means we can also collect real table data in the same manner.

For example the system profile or onstat -p can be captured with the following SQL:

SELECT
TRIM(name) AS name,
value
FROM
sysmaster:sysprofile
WHERE
name IN ('dskreads', 'bufreads', 'dskwrites', 'bufwrites', 'isamtot', 'isopens', 'isstarts', 'isreads', 'iswrites', 'isrewrites', 'isdeletes', 'iscommits', 'isrollbacks', 'latchwts', 'buffwts', 'lockreqs', 'lockwts', 'ckptwts', 'deadlks', 'lktouts', 'numckpts', 'plgpagewrites', 'plgwrites', 'llgrecs', 'llgpagewrites', 'llgwrites', 'pagreads', 'pagwrites', 'flushes', 'compress', 'fgwrites', 'lruwrites', 'chunkwrites', 'btraidx', 'dpra', 'rapgs_used', 'seqscans', 'totalsorts', 'memsorts', 'disksorts', 'maxsortspace')

It’s important to understand that all of these metrics are what I’d term counters. That is they only increase over time (unless they get so large they run out of bits and wrap or a DBA runs onstat -z). It gets difficult to see on a graph the difference between, say, 2394472 and 2394483 and so it’s useful to calculate a delta over the ten second window. Some things you might collect are automatically suitable for graphing because they are gauges: an example of this is the number of threads in your ready queue at any given moment.

Implementation

Practical demonstration with Docker containers

Nothing better than an example you can try at home (or work!). In the implementation example I will be using the IBM Informix Developer Edition Docker container which, at time of writing, runs Debian 8 (jeesie) and a second Docker container for InfluxDB and Grafana. You’ll of course need Docker installed on your laptop or workstation for this to work.

What this demonstration is going to build will look like the above. A collector script will collect metrics from Informix at a regular interval and post the results to InfluxDB. You will be able to use your usual web browser to connect to Grafana and visualise the data. Sounds simple?

We’ll start by setting up the InfluxDB/Grafana Docker container which will be also be using on a (minimal) Debian installation. In a terminal run:

docker pull debian
docker run -it --name influx_grafana_monitoring -p 8086:8086 -p 3000:3000 --hostname grafserv -e "GF_SECURITY_ADMIN_PASSWORD=secret" debian

Your terminal should now be inside the Docker container and logged in as root. Run these commands to install some extra packages and then InfluxDB:

apt-get update
apt-get -y install curl gnupg apt-transport-https procps
curl -sL https://repos.influxdata.com/influxdb.key | apt-key add -
echo "deb https://repos.influxdata.com/debian jessie stable" | tee -a /etc/apt/sources.list
apt-get update
apt-get -y install influxdb

Next install Grafana in the container:

echo "deb https://packagecloud.io/grafana/stable/debian/ jessie main" | tee -a /etc/apt/sources.list
curl https://packagecloud.io/gpg.key | apt-key add -
apt-get update
apt-get -y install grafana

Start the both services inside the container:

/etc/init.d/influxdb start
/etc/init.d/grafana-server start

We need to create an Influx database to store our time series data and we can do this with a REST API call:

curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE informix"

If it works you should see a HTTP/1.1 200 OK response.

You should now be able to access the Grafana server running in your Docker container from your usual web browser at http://localhost:3000/

Log in with the user name admin and the password secret. Once logged in click Add data source and fill in the settings as follows (some of them are case-sensitive):

Name
informix
Type
InfluxDB

HTTP settings

URL
http://localhost:8086
Access
direct

HTTP Auth

Basic auth
Leave unticked
With credentials
Leave unticked

InfluxDB Details

Database
informix
User
Leave blank
Password
Leave blank
Min time interval
Leave at 10s

All being well you should see Data source is working in a big green box.

Now we are going to set up the Informix container to monitor. On your workstation in another terminal run:

$ docker pull ibmcom/informix-developer-database
docker run -it --name iif_developer_edition --privileged -p 9088:9088 -p 9089:9089 -p 27017:27017 -p 27018:27018 -p 27883:27883 --hostname ifxserver -e LICENSE=accept ibmcom/informix-developer-database:latest

The command above should provide a running Informix instance which may take a few moments after which control is passed back to the terminal. We are now going to get the scripts that will send data to InfluxDB:

sudo -i
apt-get update
apt-get -y install git libdbi-perl libjson-perl libwww-curl-perl make gcc libtest-pod-perl

We need to get the Perl DBD::Informix package from CPAN which will download, compile, test and install it for us.

. /home/informix/ifx_dev.env
export DBI_DBNAME=sysmaster
export DBD_INFORMIX_DATABASE=sysmaster
export DBD_INFORMIX_USERNAME=informix
export DBD_INFORMIX_PASSWORD=in4mix
cpan

Enter ‘yes’ to configure as much as possible. In the CPAN console type the case-sensitive command:

install DBD::Informix

There is quite a lot that could go wrong in the CPAN console but it should work if you’re using the IBM Informix DE Docker container and follow the steps exactly. If you’re installing on RedHat Linux or a derivative the required RPM package names you use with yum install will all be different.

Type logout to exit the root shell. You should be logged in as user informix again. Leave this session for later.

Run on your local workstation (outside both Docker containers) in another terminal window:

git clone https://github.com/skybet/informix-helpers.git

This assumes you have git installed. There are two important files in the post_to_influxdb directory:

  • informix_smi_influx_uploader
  • informix_smi.json

You will need to edit informix_smi.json and change all references from mydatabase to the name of the user/application database you want to monitor. For the purposes of the blog post in this demo, we are just going to monitor the sysmaster database so change all mydatabase references to sysmaster.

You can copy the files to your Informix Docker container as follows. To get the name of your Informix Docker container (not its hostname) simply type docker ps on your workstation.

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a9c73712429 debian "bash" 8 minutes ago Up 8 minutes 0.0.0.0:3000->3000/tcp, 0.0.0.0:8086->8086/tcp influx_grafana_monitoring
e1c178b57164 ibmcom/informix-developer-database:latest "/bin/bash informi..." 13 minutes ago Up 13 minutes 0.0.0.0:9088-9089->9088-9089/tcp, 0.0.0.0:27017-27018->27017-27018/tcp, 0.0.0.0:27883->27883/tcp iif_developer_edition

From the above my Informix container name is e1c178b57164

docker cp informix_smi_influx_uploader e1c178b57164:/home/informix
docker cp informix_smi.json e1c178b57164:/home/informix

We should be ready to start collecting metrics and posting them to InfluxDB. Run in the Informix container as user informix:

cd
./informix_smi_influx_uploader -c ./informix_smi.json -i 10 -u http://other_container_ip:8086/write?db=informix

Change other_container_ip to the IP address of your InfluxDB/Grafana container. You must use the IP address unless you have name resolution which this basic Docker set up does not. If you don’t know what this is you can ping the docker container name from inside the InfluxDB/Grafana container using a command like ping -c 1 grafserv

All being well the Perl script should run continuously and collect and post data to InfluxDB every 10 seconds or whatever interval you specified with the -i switch.

To see anything in Grafana we’ll need to set up a dashboard. The file informix_server.json in the grafana_dashboard directory describes a Grafana dashboard. You’ll need to edit it a bit first and change all occurrences of the following:

<%= @hostname %>
hostname of your Informix docker container, normally the unqualified hostname of your Informix server
<%= @informixserver %>
Name of your Informix instance, dev

In the Grafana web browser click the Grafana logo and then Dashboards. Click Import. At Import Dashboard click Upload .json File. Hey presto, you should have a dashboard with graphs. Some may not display any data, e.g. Temporary dbspace usage, because there are no temporary dbspaces in the development Docker image by default.

Making this ready for production

There are a few bits I’ll leave to you for any production implementation:

  • The collection shouldn’t run as user informix. Create a user for the monitoring and give it just the CONNECT and SELECT privileges it needs.
  • You’ll also need to write a script to start/stop the collection with the instance.
  • Linux operating system statistics, gathered through collectd would complement the dashboard very nicely.
  • You’ll probably want to customise both the JSON file containing the SMI queries and the one describing the dashboard. You could add application metrics than can be collected from the database or create multiple dashboards especially if you don’t like the idea of one big one showing everything. Bear in mind any queries you write need to be fast and will run every 10 seconds.
  • In the dashboard you may need to add/remove different page sizes to/from the buffer, page and disk reads/writes graphs.

Conclusion

It can be very useful to track virtual segment and temp. space usage on your server and correlate with events like update stats, ontape/onbar backups or application activity. You can use other tools to do this but these often are not as accessible or are purely in the realm of the DBA. A Grafana dashboard like the one described here should be very useful for you and colleagues, especially if they have their own dashboards on the same system which allow you to view your systems as a whole, and it might go some distance to demystifying Informix in your organisation.


Auditing and onaudit

Compliance is one of those things you can hardly ignore as a DBA these days. Whether it’s a PCI-DSS, financial or internal best practice audit, at some point someone is going to ask you whether you are using database auditing. In my experience the auditors struggle to ask Informix specific questions but this is one that always comes up.

I guess there are three answers to this question:

  • Yes, we use database auditing.
  • No we don’t use Informix auditing but we have a third party solution somewhere else in the stack that means someone else worries about it.
  • Can we have a compensating control, please?

Rarely I find that auditors are too concerned about the detail of what you’re actually auditing. If you can log in, do some stuff and show them that this resulted in some messages in the audit log, they are usually happy. They are usually more concerned about where the logs go, who can read them and so on.

While the auditors are clearly not Informix DBAs familiar with all the auditing pneumonics, they are not daft and know they can take most of what they asked for granted next year and ask more probing questions next time.

So should you look at onaudit for your requirements? It’s been around a long time but I expect it may see a pick up in interest as more and more systems take payments in one way or another. In some ways it could do with some updates. Integration with syslog, allowing easy upload to a centralised question, is needed. There is an RFE open for this (id 58678). It’s not mine but it had six votes when I last checked and it deserves more!

Positives about onaudit include:

  • It’s free with all editions.
  • Provided you stay away from selective row auditing (I don’t cover this in this blog post) and don’t try to audit much or any of what your application does the overhead is negligible.
  • It gives you as a DBA a clearer idea of what is happening on your system.

So I think it’s certainly worthy of consideration. I know some customers prefer security solutions external to the database like Guardium but these are costly. I don’t know much about them so I shall leave that thought there.

Auditing needs to be part of a more general secure framework. If everyone at your site logs in as user informix or any shared account, the worst case being the same account as your application, it’s not going to be as useful. Applying rules by user will be difficult or impossible.

Some sites I’ve seen let DBAs do all their work as user informix. It definitely saves developing a more secure framework for DBAs to work in (this is not a good thing!) but has disadvantages. Even if you avoid shared passwords by using sudo to informix (on UNIX) having logged in as yourself, you’d need then to cross-check with the secure logs on your server to see who it was and if two people have escalated privileges at the same time it can be tricky to distinguish their actions. Ideally you need DBAs and every other real person working under their own user ids as much as possible.

To work as a DBA without access to the informix account you simply add yourself to the same group as the group owning the $INFORMIXDIR/etc folder and grant yourself dba in any databases you need to do DDL in, plus sysmaster, sysadmin, sysutils, sysha and sysuser but it still presents the following challenges which may require specific sudo type solutions:

  • Starting an instance; stopping one is no problem.
  • Running xtrace and certain oncheck commands.

Additionally as a DBA you may need root access occasionally for installations, upgrades and to use debuggers.

So before you even start there are some highly desirable prerequisites:

  • Your applications use their own account or (ideally) accounts and real users cannot run ad-hoc sessions using these.
  • Real users don’t use shared accounts (and especially not shared passwords). This means locking down the informix account.
  • DBAs practise what they preach and administer the system under their own accounts as much as possible.

Getting this far can be a struggle but even if you’re only some of the way there, you can still proceed.

The next step is consider whether to install Informix with role separation. I’m not going to discuss this at length so I’ll point to the documentation. There are no real gotchas here: it works pretty much as it says on the tin. The key idea is that it separates the DBAs from the people who decide what should be audited and who can see the audit trail. In practice I think total separation is impossible because the people deciding what should be audited need to understand the impact on the system of what they audit and the volume of data this produces. It is certainly possible to slow a system down by auditing every update.

So you’re now ready to switch on auditing? Nearly. If you monitor your system via onstat or have scripts which call onmode, ‘onmode -c [un]block’ being a specific example where care is required, you need to be aware that in all but the latest Informix releases, this includes right up to 12.10.FC5W1, as soon as you switch on auditing your onstat and onmode commands will run more slowly. This can also affect admin API command equivalents and not just the ones which are direct equivalents for onmode. The situation can get quite bad when lots of these commands run at the same time, leading to significant delays in the response from these commands.

Fortunately there are some fixes for this:

IC89645
TURNING ON THE AUDITING LEVEL 1 ADDS AN UNNECESSARY DELAY TO ONSTAT AND ONMODE COMMANDS

This has been around for a while and appeared in 11.70.FC7W1. However it is not very effective and only eliminates the delay if the volume of onstat commands being run on your system is low.

IT08812
TURNING ON THE AUDITING LEVEL 1 ADDS AN UNNECESSARY DELAY TO ONSTAT & ONMODE COMMANDS

This is completely effective and means that onstat and onmode behave identically to when auditing is switched off but it only works if you do not have any audit masks which log the use of these commands.

There are workarounds for the auditing delay such as using sysmaster equivalents for the onstat commands and performing onstat commands inside an onstat -i interactive session.

Finally you’ll want to consider setting up some audit masks. I take the following approach to this:

_require mask

This mask defines the minimum events to be audited for all users. I put everything that’s common in here.
_default mask
If an account is not assigned a specific mask, it will pick up all the events in there. To avoid having to assign masks to all real users, I don’t assign them any mask and then they automatically inherit this one (in addition to what is in the _require mask).
Other masks
For my applications and other accounts needing special treatment, I create a custom mask and assign it to the user.

Finally if you’re feeling brave switch auditing on with some commands like:

onaudit -p /path/to/audit/trail
onaudit -s 1048576 # 1 Mb files
onaudit -e 0
onaudit -l 1

Now there is just that application security model for you to tackle.

Good luck and may you sail through your audits!


Large parallel index builds and temp space

This is a quick post about parallel index builds. Today I was building with PDQPRIORITY a unfragmented detached index on a large table fragmented by range with ten large fragments and I saw this message in the online log:

10:28:53 WARNING: Not enough temp space for parallel index build.
Space required = 110566014 pages; space available = 8385216 pages.
Partial index build started.

You can see I am quite a long way short of the temp space required here; I need just over thirteen times more.

In this instance I have eight temporary dbspaces available and all are listed in the DBSPACETEMP onconfig parameter and I have no local override. They are all 2 Gb and using a 16 kb page size so have 131072 pages each and, as I am in single user mode, I know they are all free. onstat -d confirms that 131019 pages of 131072 are free in each of them. In case it’s relevant I also have 1,027,203 2 kb pages free in rootdbs.

The first thing that confuses me is the 8,385,216 pages the online log message says are available, which is more than I actually have. 131019 * 8 = 1048152. I think this is a bug as it’s a factor of 8 out. It’s probably assuming a 2 kb page size somewhere and my 16 kb dbspaces are a 8x multiple of this. I am using Linux so is Informix using native page size units and just not making it clear?

The index I am creating is on a bigint field and there are 7,076,224,823 rows. If I assume 110,566,014 pages actually means 210 Gb, the engine is calculating 32 bits/row or 4 bytes/row exactly which sounds right.

Anyway despite the message in the online log I am comforted by this IBM support article which tells me:

You do not have to take action. This is a warning. The database server will create the index one fragment at a time, instead of all at once.

However, it does advise me that cancelling the statement, adding more temp space and starting again would be a good idea. This is prescient as we’ll see.

Analysing this now it is probably going to fail somewhere because I need thirteen times more space but the engine can only divide the workload by working on a single fragment at a time. There are ten and they are not all exactly the same size. In fact my largest fragment has 1,950,612,068 rows, 27% of the total and based on 4 bytes/row the largest fragment I can handle would have only 536,653,818 rows. I suspect this means to succeed I will need at least 30,478,314 2 kb pages available to complete the build. I hope this all makes sense anyway!

Foolhardily and possibly because I get distracted by something I leave it to run. More messages appear in the log as the build progresses:

11:22:33 WARNING: Not enough temp space for parallel index build.
Space required = 110566014 pages; space available = 8385216 pages.
Partial index build started.
12:19:28 WARNING: Not enough temp space for parallel index build.
Space required = 110566014 pages; space available = 8385216 pages.
Partial index build started.
13:27:03 WARNING: Not enough temp space for parallel index build.
Space required = 110566014 pages; space available = 8385216 pages.
Partial index build started.
13:47:56 Session Insufficient space in temporary dbspaces:
Creating the temporary table in the root dbspace,
Temporary table size is 17632 pages.

Nearly four hours after it began at 14:27:41 my index build fails with:

212: Cannot add index.
179: ISAM error: no free disk space for sort

Harumph.

I guess there are two annoying things about this:

  1. The support article is only right if your largest fragment will not require more space than is available.
  2. The failure could have been foreseen at the beginning by looking at row counts.

Anyway, I hope this helps. If I get time I will do some more testing on this to confirm some of the assumptions I have made in writing this article. Feedback is welcome as ever (via https://informixdba.wordpress.com for those of you reading this on PlanetIDS)!


Ideas for monitoring an Informix instance

Is it possible to have too much monitoring of an Informix instance? I would say no, especially if your monitoring is lightweight and efficient. Being able to pinpoint an issue quickly in a disaster means you can reduce the time it takes to fix it, save your company’s money or reputation and be a hero at the same time.

I thought it would be worthwhile to do a general blog post on monitoring, partly to see if anyone will post suggestions for things I’ve missed via the comments section. I’m not going to focus too much on the exact mechanisms, although I will give a few hints here and there. Suffice to say you can do a lot with the various onstats, some simple queries on the system tables and some bash, perl, awk and greping to bind it all together. There are also powerful commercial monitoring tools, most notably AGS Server Studio, although personally I find there isn’t a single tool that fulfils all my monitoring needs.

Let’s start with the simplest monitor of all: is your instance up? One effective way of monitoring this is via the return code from onstat –. If it’s 5 all is well. But perhaps there’s a problem with the database listener which this won’t detect so maybe you should also check that you can connect to your database via TCP/IP and run a simple (and quick) query.

Once you’ve established that your instance is online, you can check for other problems such as:

  • An excessive number of sessions waiting on locks, a logical log buffer or buffers via onstat -u.
  • The number of locks in use. I would avoid querying syslocks or running onstat -k because this can become inefficient when there’s a large number of locks. Using onstat -u instead is usually good enough and doesn’t run into scalability difficulties.
  • Locks with waiters.
  • Logical logs not backed up.
  • Backup failures or storage snapshot issues. It’s good to monitor this independently rather than rely on the success or failure of your backup script.
  • Replication status for HDR and RSS servers.
  • SQL error conditions encountered by your applications.
  • Long transactions and how many logical logs are needed to support a particular transaction.
  • The number of memory segments and free memory reported by onstat -g seg.
  • Memory used by CPU VP caches. I have a separate blog post on this.
  • Error messages in the online log. I have a separate blog post on this topic too.
  • Error messages in the console log.
  • Number of ready threads, indicating a lack of available CPU VPs.
  • Dbspaces running out of unallocated space.
  • Any chunks that are down.
  • Excessively long checkpoints.
  • Out of date statistics or distributions.
  • Query plan changes and slow-running queries (this one is tricky).

It’s worth being aware of the various system limits you can run up against on larger systems:

  • Tables and indices reaching the maximum number of extents, particularly in dbspaces with a 2 kb page size.
  • Tables approaching the maximum number of data pages of 16,777,216. In not so recent versions it’s also necessary to check indices for the same limit but now the limit for detached indices is much larger (certainly it is in 11.70.FC7W1).
  • Auto-numbering fields running out of bits, particularly serials. (This may not be a problem for you if you don’t mind the next serial number wrapping around.)

It’s essential to work with your sys admins and network technicians who should have a good idea of what the operating system is doing. Often a high load average can be the first sign of a problem. They may be keeping a record of system performance that allows you to look back in time to when things were working better.

Lastly, a lot cane be discovered about your system by periodically reviewing performance data for the number of locks in use, sequential scans and so on. This is a different type of monitoring to what I’ve covered mostly here, which is about detecting issues when they occur, but sometimes problems can only be spotted by analysing the general system performance over time.

I think this is a pretty good list. Some of them are complete topics in themselves and worthy of separate blog posts.

There may be other things that you can do specific to your set up. I have a few monitors for specific bugs and the status of their workarounds, some of which are no longer needed. You may also choose to monitor the health of your application. For example, if there are message queues of some sort, are they being flushed quickly enough? Is your application regularly inserting data into its journal or log file as you expect?

My goal is to be able to detect any scenario it’s possible to predict, especially where we’ve had problems in the past. This implies that my monitoring needs to evolve and be improved over time.

Good and comprehensive monitoring should make your life as a DBA easier. It allows you to point others to your monitoring tools and gives others confidence in them if they have doubts about whether the database is working properly. It saves you having to dig around, running manual health checks or similar.

So then, anything I’ve missed?


Working with auto update stats

This article is superseded by my more comprehensive post, Experience with Auto Update Statistics (AUS).

This article has been written based on experience with version 11.70.FC5W1 and assumes some knowledge of stats and distributions and how they affect query optimisation. I appreciate any feedback in the comments section on my WordPress blog.

Auto update stats was introduced in 11.50.xC1 and, while being partly aimed at the embedded market, meant for the first time there was a complete solution to gathering database statistics bundled inside the product.

Several other tools exist to help with gathering statistics, for example AGS’s Server Studio can produce a set of UPDATE STATISTICS commands in a script to run against your database. Anecdotally, most DBAs use Art Kagel’s dostats utility, packaged up in the utils2_ak package available from the IIUG software repository. Dostats is pretty damn good although it’s not a complete solution as some scripts are needed to control it. It comes with a partner utility, drive_dostats, to do this but many DBAs, including myself, have written their own. Because dostats is the de-facto standard, I’ll refer to it a fair bit in this article. Also version 11.70 has a number of enhancements that don’t require you to use auto update stats; I’ll cover these as well.

So if you’re happily using dostats or another method to manage statistics, should you consider changing to auto update stats? Should it be your method of choice for a new-build instance? Well maybe: this article will go through some of the advantages and things to be aware of.

Here are some of its advantages:

  • The whole solution is part of the database engine and supported by IBM support.
  • It provides a complete framework, working within defined maintenance windows and is highly configurable.
  • It can be managed through OAT, although this is not needed.
  • Auto update stats does less work and takes less time than many solutions because it does not (by default) gather distributions on non-indexed columns or do separate low stats for each entire index key.
  • It’s fully integrated with the enhancements to update stats introduced in version 11.70.
  • It works on all your databases, including the system ones.
  • Perhaps my favourite feature: if you make manual adjustments, like increasing the resolution of the distributions on a column, auto update stats notices this and maintains the distributions at the new resolution. Similarly, if you manually create a distribution on a column it will maintain this.

One reason some DBAs don’t use auto update stats is because it involves using the job scheduler, which I’m told had issues in early releases with high CPU usage. For this reason, many DBAs touch a file called $INFORMIXDIR/etc/sysadmin/stop to stop it starting when the engine comes online. With 11.70.FC5W1 we run the job scheduler without any issues. (As an aside, if you’re not using it, it’s worth looking at the jobs to see what you’re missing: post_alarm_message is particularly useful.)

So can you just enable the job scheduler and let auto update stats do its thing? Not really. The first thing to look at is these onconfig parameters, which are in 11.70 and take effect regardless of the statistics method used:

  • AUTO_STAT_MODE
  • STATCHANGE
  • USTLOW_SAMPLE

Using auto stat mode and a non-zero value for STATCHANGE is something you need to consider very carefully. Internally the engine keeps a count of the number of inserts, updates and deletes that occur on each table, something that Keshava Murthy covered in his blog. If these collectively do not exceed STATCHANGE percent of the row count, statistics or distributions are not updated. This applies even when you run an UPDATE STATISTICS command manually. Confusingly the command still returns ‘Statistics updated’ even when nothing is done; the only clue is that the prompt returns instantly. To get around this there is a new FORCE keyword for the UPDATE STATISTICS statement that reverts to the old behaviour.

I find turning AUTO_STAT_MODE on and setting STATCHANGE to zero works quite well: this just skips tables where no updates, inserts or deletes have occurred.

You can set the value of STATCHANGE manually on individual tables with a fast-alter operation:

alter table table statchange change_threshold;

As it’s an update to systables, be aware an exclusive table lock is briefly needed.

I’m not keen on setting STATCHANGE to non-zero value because we have a lot of tables with incrementing date/time fields, meaning query optimisation is often time-based. I would find the option to override the age-based AUS_AGE parameter on a per-table basis much more useful, something that can only be done by writing your own script. Fortunately, as auto update stats evaluates the tables with stale stats on a regular scheduled basis, any ad-hoc updates are taken into account in its scheduling.

Setting USTLOW_SAMPLE enables sampling for UPDATE STATISTICS LOW statements, which is generally a good thing and can dramatically the time these statements take. It can be overridden in your user environment. Sampling generally works well as long as the table is not heavily skewed in some way: if Informix thinks it is you’ll see messages like this in the online log:

Warning: update statistics low using sampling may generate inaccurate index statistics for index owner.index_name due to data skew

Whether this is an issue for you will depend very much on your queries.

The other major enhancement in 11.70 is fragment-level statistics but I shan’t cover in detail here. If your storage schema is compatible with it and your table access patterns mean that some table fragments are never updated, it’s extremely useful. Informix’s implementation is nice in that the table stats are still considered as a whole when the optimiser evaluates queries, so you don’t get into trouble with having no stats for new fragments.

Perhaps the most significant difference between dostats and auto update stats is that dostats gathers additional distributions on non-indexed columns using UPDATE STATISTICS MEDIUM. If you have such distributions already auto update stats will continue to maintain them but it won’t create any new ones. All automated tools are attempting to apply a set of general criteria and recommendations to all tables so there is no hard and fast rule about whether you need them. One case to consider is where you have some sort of status flag such as a boolean or a column with a limited set of allowed values, perhaps enforced by a check constraint. Here distributions could be useful where these columns are used as filter conditions in queries. Otherwise, I suspect that in a lot of cases they are not needed. You’ll need to decide what is appropriate for your system.

Dostats also gathers low statistics separately for different indices which takes extra time but in my tests using version 11.50.FC9W2 this didn’t make any difference to the end result.

So what about switching on auto update stats? For this you’ll need to turn on the task scheduler if it’s not running already, which can be done with:

database sysadmin;
execute function task("scheduler start");

I would strongly recommend reviewing the enabled (and/or disabled) tasks and switch off any you don’t want or are not sure about. Review the jobs with:

database sysadmin;
select tk_name, tk_description from ph_task where tk_enable='t';

The relevant jobs for auto update stats are mon_table_profile, Auto Update Statistics Evaluation and Auto Update Statistics Refresh. Most of the others are fairly benign but I disable auto_tune_cpu_vps, add_storage, Low Memory Reconfig and mon_low_storage:

database sysadmin;
update ph_task set tk_enable = 'f' where tk_name in ('auto_tune_cpu_vps', 'add_storage', 'Low Memory Reconfig', 'mon_low_storage');

You’ll also find in table ph_threshold several parameters related to auto update stats:

  • AUS_AGE
  • AUS_PDQ
  • AUS_CHANGE
  • AUS_AUTO_RULES
  • AUS_SMALL_TABLES

Most of these are well-documented in the manual but the explanation of parameter AUS_AUTO_RULES is unclear as it just talks about enforcing a base set of rules. My understanding of the parameter is that:

  • When set to zero, auto update stats just maintains whatever statistics and distributions you have already. This retains any custom resolutions and confidence values you may have.
  • When set to one, it does the above plus it also makes sure that low stats are gathered on all tables, distributions in high mode for all leading index columns and distributions in medium mode for columns that are part of an index but not a leading key.

You can just update the parameters with manual SQL updates on the ph_threshold table. Likewise you’ll need to review and possibly update the scheduled run times for the two auto update statistics tasks in the ph_task table.

By default you just get one process updating stats but it’s possible to have two or more running at the same time by inserting a new row into table ph_task. I’d make sure that the total effective PDQ priority of all these tasks does not exceed 100.

At this point we’re sort of ready to go but you’re now trusting your stats gathering to a new process and I would suggest setting up some kind of monitoring to make sure it’s working as you expect. I feel this is a slight weakness in the implementation because you are back to writing your own scripts. Maybe one answer is to use OAT but this is not a good solution in our environment.

I would suggest monitoring the following:

  • That all three tasks related to auto update stats are enabled and scheduled to run at least once a week.
  • That the db scheduler is running, perhaps by checking for its threads with onstat -g ath.
  • That the values for the various AUS* parameters are sane.
  • That UPDATE STATISTICS LOW was not run too long ago for all tables. If you set AUTO_STAT_MODE it gets a little more complicated because you’ll need to use the information in Keshava Murthy’s blog post, referenced above, to work out whether the table needs to be updated.
  • Something similar for your distributions.
  • For any issues encountered whilst running the UPDATE STATISTICS statements. For this query table aus_command and check for columns where aus_cmd_state is E for error. The SQL error code and ISAM error code will then be in columns aus_cmd_err_sql and aus_cmd_err_isam respectively.

One problem you might find is that your scheduled maintenance times are not long enough to keep pace with how frequently you require your stats to be updated. You can look at adding an extra process or extending the times in this case. Even better, consider reading John Miller’s excellent article on tuning update statistics. It’s now over ten years old but still completely relevant today.

There is a view in the sysadmin database called aus_cmd_comp which shows you all the commands run recently. It gets purged daily so if you want to keep a permanent or longer record you might want to consider writing a procedure to copy its contents elsewhere and creating a scheduled task to call it.

It’s worth noting that auto update stats doesn’t do anything with stored procedure plans, i.e. UPDATE STATISTICS FOR PROCEDURE. If there are open statement handles using procedures, doing so can risk a -710 error unless (and sometimes even if) AUTO_REPREPARE set in your onconfig. Whatever the situation on your system you’ll need to do this manually or by another means.

Finally you might be wondering how the Auto Update Statistics Evaluation task prioritises tables for updating. The answer to this is in the procedures in $INFORMIXDIR/etc/sysadmin/sch_aus.sql.

In summary I like auto update stats and recommend it as long as you have a good understanding of how it works and are aware of the points I’ve raised in this article. It integrates nicely with the new features in Informix 11.70. I like the fact that it is easy to set up, although I do believe you need to monitor it if up to date stats are critical to your system. By not gathering medium-mode distributions on non-indexed columns and not running update statistics low for leading index columns, it does significantly less work than dostats. I appreciate the nice touches it has, like retaining and maintaining the existing resolution of your statistics.

As I said at the start of the article, feedback is welcomed and encouraged.