Large table operations

Introduction

Many Informix systems are at the heart of 24/7 production services and getting downtime for maintenance can be costly. Your business may have adopted Site Reliability Engineering and its concept of error budgets or just needs to minimise the times when systems are unavailable for customers.

The same systems can also store increasing volumes of data and this means thought has to be given to how large operations are approached. Every junior DBA has probably felt the need to hit Ctrl-C during an operation taking too long and the duration of any roll back needs to be weighed against letting an operation continue before doing so. Being able to replicate your production system accurately can allow you to assess the merits of each and how much time you need.

Here are some examples of large table operations a DBA might need to undertake:

  • build a new index, partition an existing index or convert to a forest-of-trees index
  • table partitioning or change of page size
  • large scale data deletion

This blog post will cover some of the methods and gotchas there are.

MethodTransaction handlingNew indexTable partitioningChange of table page sizeManaging data
ALTER FRAGMENT… ADD PARTITION…Single transactionNoAdd partitions to an already partitioned tableNoNo
CREATE INDEX…Single transactionSimple create index statementNoNoNo
ALTER FRAGMENT… INIT…Single transactionNoRebuild table in one operationRebuild table in one operationNo
ALTER FRAGMENT… ATTACH or DETACHSingle transactionNoPartition a non-partitioned tableNoAttach or detach whole table partitions.
Loopback replicationRowCopy of the whole table with new indexCopy of the whole table with the new storage schemaCopy of the whole table with the new storage schemaCopy of the table with a subset of the data
High Performance Loader (HPL)Configurable commit intervalCopy of the whole table with new indexCopy of the whole table with the new storage schemaCopy of the whole table with the new storage schemaCopy of the table with a subset of the data
Desired outcomes and some methods to achieve them

Methods covered

Before you go ahead in production

Logical logging

All the SQL data definition methods complete in a single transaction and this means you need to consider logical logging requirements.

Do you have enough logical log space to support the operation? ‘onconfig’ parameter LTXHWM determines the percentage of logs used before any long transactions are automatically rolled back. Rolling back can use more logs and reaching LTXEHWM means the long transaction rolling back gets exclusive use of the system. It is best to make sure an unexpected roll back does not occur. Rolling back got faster in 12.10 with the use of read-ahead but important to note that rolling back is generally a single-threaded operation.

If you are backing up logs properly as you should be you will also want to make sure you do not switch logs appreciably faster than you can back them up. Some backup solutions work more efficiently with larger logs.

Testing

The ideal test environment is a clone of your production system. A clone or redirected restore can then be used to:

  • repeatedly test and refine your implementation.
  • get reasonably accurate timings.
  • understand logical log requirements.
  • find out whether you can do all or some of your work with the system online (this can be tricky).

If you aren’t privileged enough to have such an environment, you need to ensure your test system is as close to production as it can be. For this type of test you need:

  • A copy of the table being worked on with the same storage layout (dbspaces, page sizes, fragmentation strategy).
  • A copy of any tables referencing or referenced via a foreign key.
  • Some data.

Often only a small amount of data is needed but it is fairly straightforward to populate test tables with tens of millions of rows, which may be enough to make any operations which could be time consuming noticeable.

Having scripts to create the test conditions from scratch which you can re-run or get the system to a state where you can backup and repeatedly restore is essential.

Easy win: ALTER FRAGMENT… ADD PARTITION…

Informix 12.10.xC3 onwards: if your table uses round-robin partitioning and partitions are filling up you can simply add more partitions to it and enable AUTOLOCATE in the onconfig file:

e.g.

ALTER FRAGMENT ON TABLE my_table ADD PARTITION my_partition_09 IN my_dbspace1;
ALTER FRAGMENT ON TABLE my_table ADD PARTITION my_partition_10 IN my_dbspace2;

This is quick, easy and avoids the complication of the other methods described here. I’ll cover options for if your table is not already partitioned shortly.

CREATE INDEX

Moving onto something a bit different but it will make some sense later.

Creating an index is easy but it’s important to bear in mind a few well-known requirements or optimisations:

  • having adequate sort space (DBSPACETEMP or file system with PSORT_DBTEMP) to avoid partial builds.
  • improving parallelism by setting PDQPRIORITY (if using Enterprise Edition).

If index page logging is enabled on your system this can substantially increase the index build time as the index build has three main stages before locks are released.

  • build the index, working with temp space if too large to do in memory.
  • write the new index to its dbspace(s).
  • scan the new index (ipl thread), generating logical log records to create a copy of it on your secondaries.

You can attempt to build an index online but this may not work on busy active tables.

The important thing to bear in mind is that the operations we go onto cover may perform index rebuilds and these same optimisations can be applied.

ALTER FRAGMENT… INIT…

This is also a very easy and tempting option for rebuilding a table: a single command which manages it all for you and leaves all references, constraints and other aspects of the logical schema intact. This avoids a lot of the preparation and testing involved with other methods.

We can use this to rebuild or repartition both tables and indices. So why don’t we use this method all the time?

  • statement runs as a single transaction.
  • table is locked during the process.
  • when altering a table all the indices on the table need to be updated with new pointers.

For large tables, especially those with a lot of indices, there will be a lot of updates written to the logical logs. There is a danger the transaction will roll back, either due to reaching LTXHWM or a DBA aborting.

We also have to do one operation at a time so repartitioning a table or an index would involve rebuilding the indices as part of the table then having to do the indices a second time.

In my opinion this operation is only a good choice for small tables or where time is not critical.

ALTER FRAGMENT… ATTACH or DETACH

Attach

This operation allows you to merge tables or split partitioned tables. As with the “init” operation, this keeps the logical schema intact but can be a slow process if table row movement occurs or indices need to be rebuilt.

With particular partitioning strategies it is possible to attach or detach partitions with no table row movement and in a subset of these also ensure no index changes occur either, which makes the whole operation very fast. Unless you planned to use these feature and designed your schema accordingly it’s unlikely you will stumble across these.

If you are thinking of using this method to partition an unpartitioned table it is possible to avoid table or index movement if moving to an expression-based table or, if using 12.10.xC15, a range partitioned table. The starting conditions are quite strict though: on the existing table any indices must reside in the same dbspace as the table. A second table with an identical logical schema must be created where indices “follow the table”, that is do not have a storage clause of their own.

Here is an example of a table which could be partitioned using this method:

drop table if exists "benthompson".toriginal;
create schema authorization "benthompson"
create table "benthompson".toriginal 
  (
    my_id int not null ,
    secondary_id integer default null,
    created datetime year to second default current year to second not null ,
    vchar1 varchar(15) not null ,
    tertiary_id integer not null ,
    expiry_date datetime year to second default null
  ) in my_dbspace1 lock mode row;

create unique index "benthompson".ioriginal_x1 on "benthompson".toriginal (my_id, vchar1, tertiary_id) using btree in my_dbspace1;
create index "benthompson".ioriginal_x2 on "benthompson".toriginal (secondary_id) using btree in my_dbspace1;
alter table "benthompson".toriginal add constraint primary key (my_id, vchar1, tertiary_id) constraint "benthompson".coriginal_pk;

The indices need to all be capable of following the table, meaning we could not have another unique index unless the leading column was my_id.

We can create an empty table with the same schema as follows:

drop table if exists "benthompson".texpression;
create table "benthompson".texpression as select * from toriginal where 1=0;

Then partition it by expression:

alter fragment on table texpression init fragment by expression (my_id < 0) in my_dbspace1, (my_id >= 910000000) in my_dbspace2;

Or with range partitioning:

alter fragment on table texpression init fragment by range (my_id) interval (10000000) partition tfm_range_p0 values < 0 in my_dbspace2;

The new table will not have any indices so index it:

create unique index "benthompson".iexpression_x1 on "benthompson".texpression (my_id, vchar1, tertiary_id) using btree;
create index "benthompson".iexpression_x2 on "benthompson".texpression (secondary_id) using btree;

As the new table is empty all these operations will be quick.

Drop constraints on the original table (required for the attach):

alter table toriginal drop constraint coriginal_pk;

Attach the original table to the new:

alter fragment on table texpression attach toriginal as (my_id>=0 and my_id <10000000) after my_dbspace1;

or for range partitioning:

alter fragment on table texpression attach toriginal as partition texpression_p1 values < 10000000;

Finally add a primary key to the new table:

alter table "benthompson".texpression add constraint primary key (my_id, vchar1, tertiary_id) constraint "benthompson".cexpression_pk;

If doing attach operations in quick succession on the same table you may wish to disable AUTO_STAT_MODE otherwise you may need to wait for an UPDATE STATISTICS operation to complete in between.

Detach

Detaching a partition is straightforward syntactically but again may be slow unless the index is partitioned the same way as the table.

For a quick detach without any index rebuilds, the database engine requires the index to follow the table (sysfragments ‘strategy’ column set to ‘T’) but there is a recent fix where this will work for expression based indices if the expression is identical to one belonging to the table.

IT34340: ALTER FRAGMENT DETACH ON TABLE FRAGMENTED BY INTERVAL REBUILDS INDEX EVEN WHEN INDEX HAS SAME INTERVAL EXPRESSION

Once a fragment has been detached the engine will rebuild table statistics if AUTO_STAT_MODE is enabled and this will prevent a second fragment being detached.

Detecting movement

If testing with a small table it can be difficult to tell how efficient the operation was. To verify whether your operation results in row movement or implicit index rebuilds, you can use ‘oncheck -pe’ to compare the extents for the table and any indices before and after your operation. All extents should remain the same if no movement occurred even though the table names may change.

Loopback replication

Overview and setup

Loopback replication was introduced in 12.10.xC11 and is perhaps most easily thought of as Enterprise Replication (ER) between two tables residing in the same instance. We can use it to rebuild a table by creating a new copy of it with whatever storage schema we desire. The new table could differ from the original in a number of ways: it could be indexed differently or have fewer or more columns, for example.

Using loopback replication requires some preparation before anything can be replicated. I won’t cover it all in detail here because it is in the Informix manual. In short you need to:

  • Create dbspaces and an sbspace for CDR_DBSPACE, CDR_QHDR_DBSPACE and CDR_QDATA_SBSPACE.
  • Tune the onconfig, especially CDR_EVALTHREADS, CDR_QUEUEMEM and CDR_MEM.
  • Make changes to your sqlhosts and onconfig variable DBSERVERALIASES
  • Define servers using ‘cdr define server’.
cdr define server -A /var/informix/ats/g_my_instance_er_server -R /var/informix/ris/g_my_instance_er_server -I g_my_instance_er_server
cdr define server -A /var/informix/ats/g_my_instance_loopback -R /var/informix/ris/g_my_instance_loopback -I g_my_instance_loopback -S g_my_instance_er_servers

I will refer to the original table as the “source” and the new table as the “target”. In my examples the source and target will share the same logical schema.

The advantage of loopback replication is that the copying of data which takes the most time can take place while your system is running, in much the same way as ER replicates tables on a running system. The steps which do require exclusive access are fast.

So how can we use it to, say, make a copy of a one billion row table with the end result being two large tables with the exact same data being replicated asynchronously? At the end of the process we want to switch off replication and use renaming to move the new table into position but how do we get to this point?

All approaches start with creating a new empty table just as we want it. As we don’t want to have to do large index builds at the end of the exercise we should create all indices and constraints on the target at the beginning. We might choose to add foreign keys, both referencing other tables and others referencing this table, and triggers at the end.

Replicates

If you’re unfamiliar with ER, tables being replicated are called “replicates”: more strictly speaking a replicate isn’t necessarily the whole table but could be just a subset of its columns and a subset of its rows, filtered by a where clause. When you start replicating from one replicate to another, what goes across? Only new transactions are replicated: for an insert this is straightforward (provided it does not violate a constraint on the target). Updates and deletes may find there is no row on the target to update or delete. You can control what happens in these scenarios with the CDR_SUPPRESS_ATSRISWARN onconfig parameter. Amongst other things this can enable is “upgrading” an update to an insert. One important thing to note is that unless your table represents some kind of rolling log, it is not sufficient to just start loopback replication: you will need to do something to load existing data.

Creating a replicate:

cdr define repl --connect=g_my_instance_er_server loopback_replicate --conflict=always --scope=row --ats --ris --floatieee --master=g_my_instance_er_server "P my_dbname@g_my_instance_er_server:benthompson.original_table" "select * from original_table" "R my_dbname@g_my_instance_loopback:benthompson.new_table" "select * from new_table"

Don’t forget to start the replicate.

Pre-loading existing data

Let’s consider some ways we can load data that pre-exists when we began replicating. This is the hardest part.

  1. Perform a dummy update on rows at the source to cause them to be replicated to the target.
  2. Use ER in-built tools to check and repair or sync the replicates.
  3. Use a combination of (2) and High Performance Loader (HPL) or other load method to do the heavy lifting.

(1) is a home brew approach but gives you control. A dummy update is simply an update which doesn’t actually change anything, i.e. setting a value to what it’s already set to, but it can be used to flag old data for replication. Probably you don’t want to use this method but it is there.

cdr check (with repair option)

‘cdr check’ goes through every row on the target and checks it exists on the source and all values within the replicate are the same. It will also find any extra rows in the target replicate not in the source. By using its repair option it will not just produce a report, it will also correct the data by pushing updates onto the send queue. It works best when the differences between the source and target are not large so preloading as much data as possible is a good idea. HPL is an efficient and straightforward way of doing this (worked example below).

‘cdr check’ isn’t quick and it is single threaded. On very busy tables it may not be able to keep up with the rate of change. A checksum operation is done on all columns in the replicate not in the primary key (or unique index) on both the source and target and compared. Rates of 2000 rows/second could be all you see. It is possible to vastly improve its performance by adding a pre-calculated “replcheck” column and indexing this column in a new unique index also containing the primary key. You will probably not be able to take advantage of this because adding this column on the source presents the same challenges you are trying to work around in the first place and additionally adding the replcheck column locks the table during the entire operation.

‘cdr check’ example:

nohup cdr check replicate --master=g_my_instance_er_server --repl=loopback_replicate g_my_instance_loopback --repair >loopback_replicate.out --name loopback_replicate_check_repair 2>&1 &

Check status with (this only works if your check replicate job has been given a name):

cdr stats check

cdr sync

An alternative is to use ‘cdr sync’ to bring the target inline with the source. Having an empty target table actually permits certain optimisations: therefore if the target only contains a small subset of the data it is best to truncate it and start from empty. The database will build a shadow replicate which will be deleted at the end of the sync. I don’t recommend ‘cdr sync’ unless you have 14.10.xC6 or later: in earlier versions a few undesirable things can happen at particular points in the process, such as a bloated send queue and a large number of row locks on participating tables when deleting the shadow replicate. ‘cdr sync’ deserves a whole article of its own but if you do wish to use it make sure CDR_EVALTHREADS onconfig parameter is < 10 and ignore the manual’s recommendation to have at least one per CPU VP.

Onconfig tuning for loopback replication

ParameterValueComments
CDR_EVALTHREADSOne per CPU VP but never more than 10, i.e.
0,10
Do not follow manual’s recommendation of one per CPU VP on large systems. Extra threads will just wait on a mutex.
CDR_QUEUEMEM65536 – 131072 if earlier than 14.10.xC664 – 128 MB is the recommendation from HCL support and larger values may be detrimental.
In 14.10.xC6+ larger values may be beneficial.
CDR_SUPPRESS_ATSRIS_WARN3Depends on your use case, ‘3’ will convert updates to inserts if rows not on the target.
CDR_QDATA_SBSPACEOne or more smart blob spaceLarge loopback jobs can result in a large send queue in versions earlier than 14.10.xC6, could easily need 10s of GBs. Smart blob spaces used by ER do not auto-expand.

Finishing off

A few tidy-up jobs:

  • Stop and delete replicate.
  • Switch tables, may include:
  • Drop foreign keys on original table.
  • Rename constraints and indices.
  • Rename original table and rename new table in its place.
  • Drop and any recreate triggers.
  • Recreate any foreign keys referencing the original table, find with:
select t.tabname, c.constrname from sysconstraints c, systables t where c.tabid=t.tabid and c.constrid in (select constrid from sysreferences where ptabid in (select tabid from systables where tabname='XXXXXXXX'));

HPL example

Unload and load jobs are needed.

Create a named pipe (this does not work on Windows; if using Windows you would need to use an intermediate file):

mknod /home/informix/mytable.pipe p

Create the unload and load jobs:

onpladm create job mytable_unload -d 'cat > /home/informix/mytable.pipe' -fup -D dbname -t original_table -T my_informixserver
onpladm create job mytable_load -d 'cat /home/informix/mytable.pipe' -flpc -D dbname -t new_table -T my_informixserver

Creating these jobs will create an onpload database in your instance if it does not already exist.

Run jobs in two different windows:

onpladm run job mytable_unload -fu -l /home/informix/mytable_unload.out -S my_informixserver
onpload -j mytable_load -fl -I 200000 -i 200000 -l /home/informix/mytable_load.out

You may want to run these in the background using nohup, e.g.

nohup onpladm run job mytable_unload -fu -l /home/informix/mytable_unload.out -S my_informixserver >unload.out 2>&1 &
nohup onpload -j mytable_load -fl -I 200000 -i 200000 -l /home/informix/mytable_load.out >load.out 2>&1 &

The log files are quite verbose and should flag any problems.

I believe the above is much easier than writing a stored procedure. HPL will handle transaction size (200000 rows per commit) in the example above. Anyone who remembers the dated and unfriendly HPL GUI will notice I did not need to use it.

If HPL goes wrong you can start again by:

  • killing any onpload or onpladm processes.
  • dropping the onpload database (this removes all job definitions).
  • truncating the target table.

There is a bug (IT36978) where HPL will corrupt compound forest-of-tree indices which may be fixed in 12.10.xC15 or 14.10.xC7. Avoid defining such an index on target tables you wish to use with HPL.

IIUG TV

This presentation has been presented to IIUG TV.

Advertisement

RSS_FLOW_CONTROL

Most of the time I like my blog to be positive or at least mostly positive with a few caveats. This post is entirely caveats.

RSS_FLOW_CONTROL, please permit to vent about this.

Man, do I dislike this feature? Where to start?

Perhaps with what it does. This feature is meant to ensure that the last acknowledged log received by an RSS server does not fall too far behind the primary’s current log position in an Informix cluster.

It sounds innocuous enough but how is this achieved? Given your RSS server is always working as fast as it can and its network link with your primary is what it is, it can only be done by stopping the primary server from writing temporarily until the RSS catches up. Many applications won’t respond well to this.

Someone somewhere probably had a use case for this along the lines of how much data loss they could tolerate in the event of losing their primary site, but is this and how it is implemented what most users want: probably not. It’s on by default, by which I mean in the onconfig.std shipped with the product it is enabled. Not only is it on but the default values of activating when the difference is just 12x the log buffer size (default 64 kB) and deactivating when this drops to 11x can see it operating the moment your network hiccups.

The onconfig.std ships with the value ‘0’ which, at a casual glance, suggests it should be off but this actually represents ‘on’ and with the default values above, which are quite aggressive. It is ‘-1’ which turns it off, to be fair there are comments in the file making this clear. Except in 12.10.xC14 where there is a bug which means it is on even when set to ‘-1’ and with those aggressive defaults.

How do you know when it’s operating? You may capture a lot of user threads waiting on the logical log buffer (state ‘G’), but this could be as a result of a myriad of other issues. However the only way to know for sure is to catch it in the act with onstat -g rss verbose and compare:

RSS flow control:3072/2816
...
Approximate Log Page Backlog:3232

As 3232 > 3072 it is working and it won’t stop until the backlog drops below 2816. There is no simple ‘disabled/off/on’ status.

Most of what I have written is the manual but what isn’t is how this can be on when you never consciously enabled it, and how it’s impossible after the event to ever tell it operated unless your diagnostics captured that one onstat at the right time. Maybe reading this post might help someone with a future investigation.

Given where we are now the following would be useful usability enhancements in my opinion:

  • ship onconfig.std with RSS_FLOW_CONTROL -1
  • print a line in the message log when this feature operates (may need to be rate limited).

If you are waiting for my post on large table operations it is coming, venting is just easier.


AUTOLOCATE

Introduction

AUTOLOCATE is a very interesting and relatively new Informix feature and one that I believe has been so far undersold. Introduced in 12.10.xC3, the Informix manual has this to say about it:

Use the AUTOLOCATE configuration parameter to control whether the database server controls the location of new databases, indexes, and tables and the fragmentation of those tables. If you set the AUTOLOCATE configuration parameter to a positive integer, the database server performs the following tasks:

  • Stores new databases for which you do not specify a location in the optimal dbspace instead of in the root dbspace. By default, all dbspaces except dbspaces that are dedicated to tenant databases are available. However, you can control the list of available dbspaces.
  • Fragments new tables by round-robin, where the number of fragments is equal to the value of the AUTOLOCATE configuration parameter.
  • Adds more table fragments as the table grows.

I had ignored it, largely because on first reading it appears to be a feature aimed at newbies to Informix. Our systems are mostly well-established and decisions regarding dbspaces and fragmentation have already been made by DBAs sweating over a keyboard driving a spreadsheet (or script).

In fact it has a few highly significant and undocumented facets that address a number of fundamental storage issues. In a nutshell if you wanted a table to have the potential to grow as large as possible you needed to be aware of the following (taken from IBM support document, What is the maximum number of pages for a table or fragment? What is the maximum number of rows per page?):

  • 3 bytes are used to store the page number a maximum of 16,777,215 pages can be used in one table.
  • 1 byte is used to store the slot number, a single page can have at most 255 slots/rows.
  • The maximum number of rows in a table or fragment is 4,278,189,825.

The last limitation is the combination of the first two at their limits and doesn’t need separate consideration, however all of this combined means:

  • any table larger than 32, 64, 96, 128, 160, 196 or 256 GB, depending on the page size (2-16 kB), needs to be partitioned.
  • tables with narrow rows cannot benefit from the larger page sizes without wasting storage space.
  • when defining an interval for a table partitioned by range, the interval must be not so large that it allows fragments to reach the page limit.
  • tables partitioned by round-robin are limited to the above size limits multiplied by the number of fragments you created.

The fundamental limits from the support document are still there and probably will be for some time to come. However, AUTOLOCATE makes it much much easier to deal with the most significant of them, as we will see.

AUTOLOCATE in action

Firstly let’s imagine a table using a 2 kB page size, fragmented by round robin with two partitions. Its maximum size is 64 GB. The term round-robin refers to how data gets inserted into the table. It will be in the fashion 1-2-1-2-1-2 etc. If the table had four partitions it would be 1-2-3-4-1-2-3-4-1-2-3-4. If your table starts from empty and you insert data over a period of time, all the partitions will contain the same number of rows. What happens if my table with two partitions approaches 64 GB, let’s say 60 GB, and I cannot delete any data and must keep inserting more? Well I can add another partition or two. (Even another ten.) My table will now will fill up 1-2-3-4-1-2-3-4 etc. However two partitions have 60 GB and two are empty so I am still 4 GB away from two of the partitions filling; it is just going to take twice as long to get there.

Let’s see how AUTOLOCATE can help by creating a table for testing and filling it with data.

Create table and populate:
drop table if exists rr_table;

create table rr_table (
myid bigserial not null,
mydata char(40) not null
) with rowids fragment by round robin in 2kb_dbspace1, 2kb_dbspace2;

create table source_data (
text_field varchar(40)
);

insert into source_data values ('r4dR3leer4IWTLNo5rZpYS9HTGrQngjcDp5FcQwn');
insert into source_data values ('FtIgPPVDPvnqlSf6QNSkkIDv7GvXIrN2L22w4x7K');
insert into source_data values ('N5VtAmIxSl89lt3O34hshvCktWjYOiMdGJjwwjVe');
insert into source_data values ('Kb13xQcI5IZqFjC0bu4u4zqK1JUSvZKjmGD1OZ3K');
insert into source_data values ('5XhqJQ1lm2LYKzatCngXrEJMEDGF8Qw5gxsetqTx');
insert into source_data values ('C1EXpTrmk28F2ETnvoK1QhF8v3gA4iBciaQw7NtW');
insert into source_data values ('7OvpiycZRjTFPEzk6FM2hJsORWVEAUhOM2O211tr');

drop procedure if exists insert_data;
create procedure insert_data ()

loop
begin work;
insert into rr_table (mydata) select * from source_data;
commit;
end loop;

end procedure;

execute procedure insert_data();

While this was running to fill up my table, I thought I would experiment with a different number of threads and values for AUTOLOCATE. Each result below is the average of five measurements taken over ten seconds. AUTOLOCATE can be changed dynamically and immediately affects running sessions. For the tests I created multiple source tables to rule out contention there.

Number of sessions inserting into two-way round-robin table, no indices Rows inserted per second
AUTOLOCATE 0
Rows inserted per second
AUTOLOCATE 1
1 85 300 90 300
2 94 400 93 400
3 110 200 99 700
4 117 200 100 500
5 121 400 96 000
6 120 100 91 300
7 115 800 88 800
8 109 000 87 000

(For values of AUTOLOCATE greater than 1 the results were very similar to 1 so I haven’t bothered to include any in the table.)

So we can see that AUTOLOCATE does have an effect on how data are inserted: it changes the algorithm for insertion into tables fragmented by round-robin. If the parameter is set to any non-zero value the partition which is most empty will receive the rows. This means it isn’t really round-robin any more. However, when populating a table from empty the behaviour appears to be more or less the same, however in my example once my two new partitions are added they will receive all the new rows.

What happens in the situation where one or more fragments reaches the page limit? Without AUTOLOCATE set the database engine will continue to insert data into fragments with space, however maximum insert performance will reduce by a factor of about two hundred. This is presumably because it checks bitmap pages in the full fragments each time their turn comes in case any data were deleted by other processes.

However, with AUTOLOCATE set, with the new algorithm (most empty fragment gets priority) this situation does not arise unless all fragments are full.

Furthermore with AUTOLOCATE, when all fragments are full the database engine will automatically add more. One thing to be aware of is if, say, you had a R-R table with eight fragments and they all fill, if AUTOLOCATE is set to just 1, the engine will add just one fragment so you are no longer writing into multiple fragments. However the Informix manual section Managing automatic location and fragmentation explains this and how to control the dbspace used very well.

With AUTOLOCATE set to 0, I eventually hit the dreaded 271: Could not insert new row into the table.

The online log reports it too:
WARNING: partition 'mydbname:informix.rr_table': no more pages

Adding two more fragments with ALTER FRAGMENT ADD I can now test the situation where two fragments are full and two are empty and show the basis for the two-hundred times speed reduction claim above. I suspect this will be highly dependent on the number of bitmap pages to scan and how these pages are marked. In this test, if AUTOLOCATE is 0 you’ll continue to periodically see the no more pages message in the online log.

Number of sessions inserting into four-way round-robin table with two full fragments, no indices Rows inserted per second
AUTOLOCATE 0
Rows inserted per second
AUTOLOCATE 1
1 390 74 220
2 588 83 300
3 741 91 900
4 919 96 100

There is a bit of a drop in performance over the initial results with a two-way R-R table, even when AUTOLOCATE is a non-zero value. An assumption you could make is that this is because there are more fragments to check the size of before inserting but I have not tested or verified this.

So you can see this is extremely useful if your round-robin fragmented tables are getting large: just add more fragments and ensure AUTOLOCATE is on; or even if you’re brave let the engine do this for you. I say brave because you should be aware that our testing found defect IT34090 which is where this process fails for tables with rowids and varchar or lvarchar columns. I hope it will be fixed in 12.10.xC15 and 14.10.xC5.

Bored of calculating the optimum page sizes for your tables?

I also mentioned that narrow tables cannot benefit from large page sizes because 255 times the row size can often come to under the page size. This means a careful DBA needs to calculate the optimum page size for any given table. There will usually be a range of page sizes that will be suitable and two approaches I have used are to take the largest that will not waste space or, if you perhaps intend to use table compression, assume your rows are two to three times less wide than they are and apply the same logic (since compression increases the number of rows per page possible but does not change the 255 rows/page limit).

However you can now let AUTOLOCATE do this job for you. The Informix manual states: stores new databases for which you do not specify a location in the optimal dbspace instead of in the root dbspace. “Optimal” is not defined but I hope it would take into account all the considerations above.

It’s quite easy to start experimenting by providing the database engine with a list of dbspaces of different page sizes and see where your data gets placed. (You will need to insert at least one row for an extent to actually be created.)

EXECUTE FUNCTION task("autolocate database", "my_dbname", "2kb_dbspace1,4kb_dbspace1,6kb_dbspace1,8kb_dbspace1,10kb_dbspace1,12kb_dbspace1,14kb_dbspace1,16kb_dbspace1");

Selected results are in the tables below. My initial expectation was that there would be clear stepping points based on the row size where the algorithm would select the next page size up and, while this is definitely an overall trend, the inner workings of the algorithm were difficult to ascertain when I only examined ten or so tables. To try and see the patterns more clearly I created around 1500 tables with different schemas and examined the page size selected versus the row size. For the test all possible page sizes were made available but your system may restrict what can be used.

The first result I found is that the page size selected seems to entirely depend on the row size as reported in systables; it does not seem to matter what columns are used, specifically whether there are any variable length columns.

The formula for calculating rows per page is rows per page = trunc(pageuse/(rowsize + 4)) where pageuse is the page size in bytes minus 28 (source Informix manual, Estimating tables with fixed-length rows). We can use this to calculate the minimum row size that can be used for each page size without wasting space.

Page size (bytes) Minimum row size to fill page based on formula Smallest row size at which AUTOLOCATE will select this page size based on empirical evidence Largest row size at which AUTOLOCATE will select this page size based on empirical evidence
2048 4 1 32074
4096 12 43 2021
6144 20 85 1515
8192 28 81 1615
10240 36 137 4591
12288 44 133 2403
14336 52 142 4574
16384 60 224 3187

For the largest row size, I didn’t exhaustively try every row size; these are just what is used in the database schema I loaded.

We can maybe begin to see what the engine is trying to do. There is clearly a smallest row size for which each page size will be chosen and this gives at least 2.7x head room, presumably to allow compression to be effective.

At the extreme, there is a seemingly rogue result for the table with a 32074 row size (comprised of a serial, int, varchar(64), char(32000) and char(1) columns) being put in a 2 kB dbspace. (I would have thought 16 kB would suit better, minimising the number of pages each row spans.) Some simple math(s) this suggests that minimising space may be the overriding consideration as the following examples show.

  • 4096 page size: 2 x (2021 + 4) + 28 is 4078.
  • 6144 page size: 4 x (1515 + 4) + 28 is 6104.
  • 8192 page size: 5 x (1615 + 4) + 28 is 8123.
  • 10240 page size: 2 x (4591 + 4) + 28 is 9218 (not such a good fit).
  • 12288 page size: 5 x (2403 + 4) + 28 is 12035.
  • 14336 page size: 3 x (4574 + 4) + 28 is 13762.
  • 16384 page size: 5 x (3187 + 4) + 28 is 15973.

If we apply the same logic elsewhere what do we get? Looking at a range of ten values minimising the space wasted does indeed seem to be a goal.

Row size Page size selected (kB) Bytes wasted 2 kB Bytes wasted 4 kB Bytes wasted 6 kB Bytes wasted 8 kB Bytes wasted 10 kB Bytes wasted 12 kB Bytes wasted 14 kB Bytes wasted 16 kB
133 12 102 95 88 81 74 67 60 53
134 8 88 66 44 22 0 116 94 72
135 6 74 37 0 102 65 28 130 93
136 4 60 8 96 44 132 80 28 116
137 10 46 120 53 127 60 134 67 0
138 6 32 92 10 70 130 48 108 26
139 8 18 64 110 13 59 105 8 54
140 2 4 36 68 100 132 20 52 84
141 4 135 8 26 44 62 80 98 116
142 14 122 126 130 134 138 142 0 4

I did these calculations in Excel so they must be right, yeah? Assuming some of the possible page sizes are just too large to be considered, the only thing this theory can’t explain is why for a 137 row size the engine chose a 10 kB page size over a 2 kB one.

What is clear is that using the largest page size possible, thus reducing the number of I/O requests is not a feature of AUTOLOCATE. Perhaps for many systems this doesn’t help anyway: in the OLTP world if you want to read a row and need to read in one 2 kB page or one 16 kB page to do so, which is actually more efficient on memory and the buffer pools?


Monitoring Informix with the Elastic Stack

Introduction

If you’re not familiar with the Elastic Stack it is a suite of products for ingesting data or logs, searching, analysing and visualising. There is a good overview over at the Elastic web site of how it can be put together. I say “can” because the stack is very flexible and, for example, you can send JSON documents to Elasticsearch via a REST API, rather than use Filebeat or Logstash.

This blog post is mostly concerned with ingesting the Informix online log with Filebeat, recognising certain types of log line that can occur and tagging the file using rules set up in Logstash, before sending it to Elasticsearch for storage and indexing. Finally Kibana can be used to visualise the data stored in Elasticsearch.

It’s easy to see how this could be scaled up to provide a single place to visualise logs from multiple instances and it would be fairly trivial to add in other logs too, like the Informix bar logs and logs from the operating system.

At IIUG 2018 in Arlington, VA I presented a talk entitled DevOps for DBAs, which demonstrated the Docker set up now described below but I hadn’t at the time documented the full set up at the time. Here it is!

Practical demonstration with Docker containers

Overview

This demonstration sets up two containers: one running Informix Developer Edition and Filebeat to collect and ship logs:

  • Informix 12.10.FC12W1DE, listening on port 9088/tcp for onsoctcp connections.
  • Filebeat 6.5.2.

and the other running the Elasticstack components as follows:

  • Logstash 6.5.2, listening on port 5044/tcp.
  • Elasticsearch 6.5.2, listening on port 9200/tcp.
  • Kibana 6.5.2, listening on port 5601/tcp.

Access to Kibana is via your favourite web browser running on your desktop. Nginx will be listening on port 80 so you can simply access http://localhost/

For a secure production implementation it’s recommended that you use Nginx with HTTPS as a reverse proxy for the Kibana web service as shown in the diagram. We’ll be using Nginx in this demonstration, rather than connecting to Kibana directly, but we won’t be configuring SSL; there are plenty of online guides about how to do this. Also communication between Filebeat and Logstash should be encrypted: this blog post doesn’t cover this.

The above versions are current at the time of writing (December 2018). Elasticstack moves very quickly so it is likely these will not be the latest versions by the time you read this. The idea of this blog post is that you should just be able to copy and paste the commands and end up with a working system but you shouldn’t be surprised if things don’t work perfectly if your versions don’t match the above. For example, in between beginning this blog post and finishing it, version 6.5.x was released with improved default security settings, with services only listening on the loopback interface without reconfiguration.

Running the whole Elasticstack in Docker plus Informix does require a reasonable amount of memory and I’d suggest a minimum of 2.5 GB to be allocated to the Docker Engine.

Docker network

To provide name resolution between the containers we are going to start by creating a docker network:

docker network create --driver bridge my_informix_elk_stack

Elastic Stack installation

Docker container

We’ll start by setting up the Elastic Stack Docker container which will be using on a (minimal) Debian installation. In a terminal run:

docker pull debian
docker run -it --name elasticstack_monitoring -p 80:80 -p 5044:5044 --hostname elasticstack -e "GF_SECURITY_ADMIN_PASSWORD=secret" --net my_informix_elk_stack debian

Your terminal should now be inside the Docker container and logged in as root.

To avoid some issues with debconf when installing packages run:

echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

Install Java

Run these commands to install some the software-properties-common package and then install OpenJDK 8, which is required by Elasticsearch and Logstash. Java 9 should be fine too.

The Debian Docker image does not come with many packages pre-installed so I am also going to install vim for editing files later plus a few other essentials; you may prefer nano or another editor.

apt-get update
apt-get install software-properties-common gnupg vim wget apt-transport-https openjdk-8-jre

The Debian Docker container is a basic installation so this short list of packages have hundreds of dependencies.

Check the Java version:

# java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-2~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Install Elasticsearch

The Elasticsearch installation is more straightforward than Oracle Java and follows standard Linux methods for setting up and installing from a third party software repository.

First we install the repository’s key:

wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -

Then add the repository:

echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-6.x.list

We need the HTTPS transport package installing before we can proceed to install.

apt-get update
apt-get install elasticsearch

Elasticsearch will work right out of the box which is fine for the purposes of this demonstration and (after we start the service) will be listening on localhost only on ports 9200 for the REST API and 9300 for node communication.

Install Kibana

This is installed from the Elasticstack repository added above:

apt-get install kibana

Again Kibana doesn’t require any reconfiguration for the purposes of this demonstration and will listen on localhost only on port 5601.

Start Kibana by running:

service kibana start

Now that Kibana is running start Elasticsearch by running:

service elasticsearch start

It’s worth noting as this point that modern Debian distributions use systemd but this doesn’t work in non-privileged Docker containers. For reference the systemd equivalents for this are:

systemctl daemon-reload
systemctl enable kibana
systemctl enable elasticsearch
systemctl start kibana
systemctl start elasticsearch

This commands also ensure the service starts on boot.

As Kibana is only listening on localhost and therefore unreachable from an external web browser, we will set up Nginx as a reverse proxy. This is a more secure configuration and recommended for any production implementation because only Nginx is directly exposed to the internet and not Kibana.

Install Nginx as a reverse proxy

Start by installing Nginx:

apt-get install nginx

Edit the file /etc/nginx/sites-available/default and in the location / section add the line:

proxy_pass http://localhost:5601;

Comment out the line beginning with try_files.

It should look something like this:

    location / {
        # First attempt to serve request as file, then
        # as directory, then fall back to displaying a 404.
        proxy_pass http://localhost:5601;
        #try_files $uri $uri/ =404;
    }

Don’t forget the semi-colon!

Start nginx:

service nginx start

Install Logstash

An installation procedure that by now should look familiar:

apt-get install logstash

Our Logstash configuration will be in two parts:

  1. A standard out of the box configuration for receiving files from Filebeat and sending them to Elasticsearch: for this copy /etc/logstash/logstash-sample.conf to /etc/logstash/conf.d/logstash.conf
  2. A custom config file, /etc/logstash/conf.d/informix.conf, for parsing the Informix online log.

I intend to update and improve the Logstash config for filtering the Informix online log and it’s available on my Github page at https://github.com/skybet/informix-helpers/blob/master/logstash/informix.conf. Download it locally and then you can copy it to your Docker container as follows:

docker cp informix.conf elasticstack_monitoring:/etc/logstash/conf.d/informix.conf

The Informix config file requires that Filebeat tags the file with “[field][informix] = true“; this condition is trivial to remove if you wish.

Check the Logstash configuration with:

/usr/share/logstash/bin/logstash -t --path.settings /etc/logstash

Finally start Logstash with:

/usr/share/logstash/bin/logstash --path.settings /etc/logstash

You could also use systemd to do this.

Informix installation

Informix Developer Edition Docker container

Now we are going to set up the Informix container to monitor. On your workstation in another terminal run:

$ docker pull ibmcom/informix-developer-database
docker run -it --name iif_developer_edition --privileged -p 9088:9088 -p 9089:9089 -p 27017:27017 -p 27018:27018 -p 27883:27883 --hostname ifxserver --net my_informix_elk_stack -e LICENSE=accept ibmcom/informix-developer-database:latest

It’s worth noting that if you exit the shell the Informix DE Docker container will stop. You can start it again with:

docker start iif_developer_edition -i

This latest version containing 12.10.FC12W1 doesn’t return you to the prompt after the engine starts, so you’ll need to open an interactive container in another terminal window as follows:

docker exec -it iif_developer_edition /bin/bash

Now both Docker containers are running you should be able to test name resolution and connectivity both ways with ping.

From the Informix container:

informix@ifxserver:/$ ping elasticstack_monitoring

From the Elastic Stack container:

root@elasticstack2:/# ping iif_developer_edition

These names belong to the containers and are not necessarily their host names.

While you’re still logged in as Informix set MSG_DATE to 1 which is required for my Logstash configuration:

onmode -wf MSG_DATE=1

This means (nearly) all online log messages will be prefixed with the date (MM/DD/YY format) and time (HH:MM:SS format).

You’ll be logged in as user informix which can sudo to root as follows:
sudo -i

Install Filebeat

In the Informix container it’s more of the same to install Filebeat:

apt-get update
apt-get install vim wget apt-transport-https
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
echo "deb https://artifacts.elastic.co/packages/6.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-6.x.list
apt-get update
apt-get install filebeat

Filebeat’s configuration file is /etc/filebeat/filebeat.yml. If you’re not familiar with yml files correct indentation with two spaces per level is absolutely essential; yml files rely on this for structure instead of any braces or brackets. Lines beginning with a dash indicate an array.

onstat -c | grep MSGPATH reveals that the Informix online log file resides at /opt/ibm/data/logs/online.log and we want Filebeat to ingest this and pass it to Logstash running in the other container for parsing.

The Informix online log quite often contains wrapped lines and these generally don’t start with a timestamp or date of any kind.

Edit the file and add the following configuration directly under filebeat.inputs:

- type: log
  enabled: true
  paths:
    - /opt/ibm/data/logs/online.log
  fields:
    informix: true
  multiline.pattern: '^[0-9][0-9]'
  multiline.negate: true
  multiline.match: after

Finally set up the output by commenting out (add a ‘#’) to all parts of output.elasticsearch and uncommenting section output.logstash. Set hosts in this section to [“elasticstack_monitoring:5044”].

Start filebeat by running:

service filebeat start

You should see the message: Config OK.

Using Kibana

You should now be able to log into Kibana on http://localhost/ and add an index on filebeat*.

Then in the Discover section you should be able to see Informix log lines coming into the system and how they have been tagged.

If you wish to improve and test the parsing of particular log entries it is simple enough to create ones yourself in the Informix container like this:

echo "12/08/18 17:00:00 My Informix log line" >> /opt/ibm/data/logs/online.log

This particular blog post is going to end shortly. Kibana is a big subject and at this stage of my Elasticstack DevOps journey, I don’t feel qualified to write a blog on the subject. Once I’ve progressed further I may do a follow-on post.

The tagging of events in the online log like the checkpoint duration and its assignment to variable informix.ckpt_duration should allow you do easily do searches based on this and visualise them in dashboards.

Good luck!


Forest of trees

Introduction

Forest of trees (FOT) indices were a new feature in 11.70 and are designed to avoid multiple sqlexec threads competing for access to the same index root node (top level of the index) and wasting CPU cycles. When a thread cannot access the root node because it is latched by another thread, it will spin and re-try. This informs the approach taken by the Informix official documentation to finding where a FOT will be beneficial: checking for database objects (that happen to be indices) with the most spin locks.

You can create a FOT like this:
CREATE INDEX fotidx ON tab(c1) hash on (c1) with 100 buckets;

The idea behind FOTs is to split the index into many “buckets”, each with its own root node, improving concurrency by reducing competition between threads. A hash function on a specified column is used to determine in which bucket to place to retrieve or place a row, which in most cases can only practicably be a hash on the leading column of the index. From this it follows that the leading column needs to have a lot of distinct values and not be skewed otherwise the hash function will return the same value frequently negating the potential benefits. Certainly it is pointless to have more buckets than you have distinct values on the hashed column and, unless you want a set of index buckets consisting of a single value, you’ll want the number of buckets to be a fraction of the distinct values since different column values can of course hash to the same bucket.

If you’ve so far ignored this feature do you need it? What can be the benefits of implementing a FOT given there are some potential drawbacks such as not being able to do range scans on the hashed column? What other symptoms are there? And is the implementation as good as it could be?

Do I need one?

As the problem FOTs attempt to solve is competition between threads for access to the same index root nodes, it follows that you are unlikely to ever need one on systems with one or only a few CPU virtual processors (CPU VPs). Without multiple CPU VPs it is impossible for multiple sqlexec threads to be accessing the same buffers at the same time. It also follows that you need a fair number of database connections and for these to be doing similar things. By similar I don’t mean exactly the same because the threads would still compete for the same root node. A good example of where a FOT would work well is where multiple sessions select all the data for a customer by customer id.: the same query is being run in many sessions but the customer id. is different. It’s worth mentioning that this contention occurs purely in memory on the database server; and the performance problem the feature addresses is not related to fetching rows from storage.

What are the benefits?

At the instance level, a large number of spin locks manifests itself as high CPU usage without the database server doing much useful work. Of course there can be many causes of high CPU usage such as poor SQL, poor query plans, slow disk etc. but a feature of systems suffering this kind of contention is that they don’t scale linearly. If an index is not very busy the FOT is not giving you any benefits over a standard B-Tree index but as load ramps up and an index is used more intensively by multiple concurrent threads, spin waits with a standard B-Tree can increase dramatically consuming CPU cycles for no real gain.

What other symptoms are there of spin waits caused by index root node contention?

  1. CPU load suddenly jumping up when load hits a certain level.
  2. A large number of buffer waits (‘onstat -X‘) on the index concerned.
  3. Many running sqlexec threads are all running the same SQL statements. You may see these threads in this state:
    tid name rstcb flags curstk status
    5362439 sqlexec 1f519b358 ---PR-- 11888 running-

Is the implementation as good as it could be?

There are a couple of missing features in the implementation, both around the creation of the index itself rather than accessing or updating it.:

  • While you can allocate more memory to the session building the index by setting PDQPRIORITY building an FOT is always a single-threaded operation, meaning it can be slow with large tables.
  • Rebuilding an index as an FOT via “alter fragment on index” is not supported, meaning you can’t rebuild the index and leave the logical schema intact. Instead you must drop and recreate the index. This isn’t too painful if the index is there for performance reasons only but if it’s supporting constraints you need to drop/recreate these too and, if it’s a primary key, any referencing foreign keys will be implicitly dropped. You can find any foreign keys referencing a given table with:
    SELECT
    t.tabname,
    c.constrname
    FROM
    sysconstraints c,
    systables t
    WHERE
    c.tabid=t.tabid AND
    c.constrid IN (
    SELECT
    constrid
    FROM
    sysreferences
    WHERE
    ptabid IN (
    SELECT
    tabid
    FROM
    systables
    WHERE
    tabname='XXXXXX'
    )
    );

    Then use dbschema to fetch the SQL to recreate the foreign keys.

If you think these improvements to the implementation would be beneficial to you, there is an IBM RFE you can vote for.

Summary

This is a feature that you may never need and aimed at solving a specific type of performance problem very effectively.


Extent size doubling

This post was inspired a discussion one of the IIUG Technical Special Interest Groups, titled Outrageous Extent sizes.

When you write to an Informix table or index (referred to collectively as an “object”) the Informix engine writes to an extent, a continuous allocation of disk of a fixed size dedicated to that table or index. Informix has a limit on the number of extents an object may have which depends on the page size: around 200 extent for a 2 kB dbspace rising to around 2000 for a 16 kB dbspace.

To ensure space is not wasted new extents start small and then the engine increases the space it allocates as the table grows.

In versions of Informix now unsupported the algorithm was to allocate four extents of the initial extent size and then start using the next extent size. The next extent size doubled every sixteen allocations. Unfortunately this algorithm was slow to “get going” and objects, especially those in 2 kB dbspaces, could quickly gain a lot of extents and approach the extent limit.

To avoid this problem the DBA could modify the first and next extent sizes to suit, which was useful especially if you could anticipate future growth.

In recent versions of Informix the algorithm for allocating extents to tables has changed and become more aggressive, the result being less intervention required by a DBA but perhaps unwanted surprises when large amounts of unallocated space are assigned to objects whenever a new extent is created.

The Informix manual states with regard to extent size doubling:

For permanent tables or user-defined temporary tables, the size of the next extent for every allocation is automatically doubled. The size doubles up to 128 kilobytes (KB). For example, if you create a table with the NEXT SIZE equal to 15 KB, the database server allocates the first extent at a size of 15 KB. The next extent is allocated at 30 KB, and the extent after that is allocated at 60 KB. When the extent size reaches 128 KB, the size is doubled only when the remaining space in the table is less than 10% of the total allocated space in the table.

For system-created temporary tables, the next-extent size begins to double after 4 extents have been added.

I found this explanation kind of confusing and – as we’ll see – it’s slightly wrong anyway.

As ever I am going to use Perl, a DBA favourite, to investigate what is happening. I am going to create a simple table as follows:

create table t1 (
myid bigserial not null,
mydata varchar(254) not null
) in my4kspace extent size 16 next size 16 ;

create unique index ui_t1 on t1 (myid) in my4kspace;

Then what I am going to do is insert random data one row at a time into the mydata column and, after every insert, check the next extent size used by referencing the partnum in sysmaster:sysptnhdr. If a new extent is added or the next extent size changes I’m going to print a one-line summary and do this for both the table and its index.

It’s worth mentioning that systnphdr records the actual extent sizes being used, not systables which contains the value last set using SQL or when the table was created.

Here are my results:

TABLE: Next size: 16 kB (4 pages); nptotal: 4 (16 kB); nrows: 1
INDEX: Next size: 16 kB (4 pages); nptotal: 4 (16 kB)
TABLE: Next size: 32 kB (8 pages); nptotal: 8 (32 kB); nrows: 98; ratio: 100.00 (16 kB -> 32 kB)
TABLE: Next size: 64 kB (16 pages); nptotal: 16 (64 kB); nrows: 207; ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 128 kB (32 pages); nptotal: 20 (80 kB); nrows: 424; ratio: 100.00 (64 kB -> 128 kB)
INDEX: Next size: 32 kB (8 pages); nptotal: 8 (32 kB); ratio: 100.00 (16 kB -> 32 kB)
TABLE: Next size: 256 kB (64 pages); nptotal: 52 (208 kB); nrows: 529; ratio: 160.00 (128 kB -> 256 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 116 (464 kB); nrows: 1402; ratio: 123.08 (256 kB -> 512 kB)
INDEX: Next size: 64 kB (16 pages); nptotal: 16 (64 kB); ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 244 (976 kB); nrows: 3157; ratio: 110.34
INDEX: Next size: 128 kB (32 pages); nptotal: 32 (128 kB); ratio: 100.00 (64 kB -> 128 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 372 (1488 kB); nrows: 6636; ratio: 52.46
INDEX: Next size: 256 kB (64 pages); nptotal: 64 (256 kB); ratio: 100.00 (128 kB -> 256 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 500 (2000 kB); nrows: 10162; ratio: 34.41
TABLE: Next size: 512 kB (128 pages); nptotal: 628 (2512 kB); nrows: 13697; ratio: 25.60
INDEX: Next size: 512 kB (128 pages); nptotal: 128 (512 kB); ratio: 100.00 (256 kB -> 512 kB)
TABLE: Next size: 512 kB (128 pages); nptotal: 756 (3024 kB); nrows: 17255; ratio: 20.38
TABLE: Next size: 512 kB (128 pages); nptotal: 884 (3536 kB); nrows: 20741; ratio: 16.93
TABLE: Next size: 512 kB (128 pages); nptotal: 1012 (4048 kB); nrows: 24332; ratio: 14.48
TABLE: Next size: 512 kB (128 pages); nptotal: 1140 (4560 kB); nrows: 27883; ratio: 12.65
INDEX: Next size: 512 kB (128 pages); nptotal: 256 (1024 kB); ratio: 100.00
TABLE: Next size: 512 kB (128 pages); nptotal: 1268 (5072 kB); nrows: 31463; ratio: 11.23
TABLE: Next size: 512 kB (128 pages); nptotal: 1396 (5584 kB); nrows: 34921; ratio: 10.09
TABLE: Next size: 1024 kB (256 pages); nptotal: 1524 (6096 kB); nrows: 38471; ratio: 9.17 (512 kB -> 1024 kB)
TABLE: Next size: 1024 kB (256 pages); nptotal: 1780 (7120 kB); nrows: 41965; ratio: 16.80
TABLE: Next size: 1024 kB (256 pages); nptotal: 2036 (8144 kB); nrows: 49007; ratio: 14.38
TABLE: Next size: 1024 kB (256 pages); nptotal: 2292 (9168 kB); nrows: 56089; ratio: 12.57
INDEX: Next size: 512 kB (128 pages); nptotal: 384 (1536 kB); ratio: 50.00
TABLE: Next size: 1024 kB (256 pages); nptotal: 2548 (10192 kB); nrows: 63151; ratio: 11.17
TABLE: Next size: 1024 kB (256 pages); nptotal: 2804 (11216 kB); nrows: 70325; ratio: 10.05
TABLE: Next size: 2048 kB (512 pages); nptotal: 3060 (12240 kB); nrows: 77402; ratio: 9.13 (1024 kB -> 2048 kB)
TABLE: Next size: 2048 kB (512 pages); nptotal: 3572 (14288 kB); nrows: 84473; ratio: 16.73
INDEX: Next size: 512 kB (128 pages); nptotal: 512 (2048 kB); ratio: 33.33
TABLE: Next size: 2048 kB (512 pages); nptotal: 4084 (16336 kB); nrows: 98674; ratio: 14.33
TABLE: Next size: 2048 kB (512 pages); nptotal: 4596 (18384 kB); nrows: 112742; ratio: 12.54
INDEX: Next size: 512 kB (128 pages); nptotal: 640 (2560 kB); ratio: 25.00
TABLE: Next size: 2048 kB (512 pages); nptotal: 5108 (20432 kB); nrows: 126934; ratio: 11.14
TABLE: Next size: 2048 kB (512 pages); nptotal: 5620 (22480 kB); nrows: 141188; ratio: 10.02
INDEX: Next size: 512 kB (128 pages); nptotal: 768 (3072 kB); ratio: 20.00
TABLE: Next size: 4096 kB (1024 pages); nptotal: 6132 (24528 kB); nrows: 155312; ratio: 9.11 (2048 kB -> 4096 kB)
TABLE: Next size: 4096 kB (1024 pages); nptotal: 7156 (28624 kB); nrows: 169379; ratio: 16.70
INDEX: Next size: 512 kB (128 pages); nptotal: 896 (3584 kB); ratio: 16.67
TABLE: Next size: 4096 kB (1024 pages); nptotal: 8180 (32720 kB); nrows: 197862; ratio: 14.31
INDEX: Next size: 512 kB (128 pages); nptotal: 1024 (4096 kB); ratio: 14.29
TABLE: Next size: 4096 kB (1024 pages); nptotal: 9204 (36816 kB); nrows: 226153; ratio: 12.52
INDEX: Next size: 512 kB (128 pages); nptotal: 1152 (4608 kB); ratio: 12.50
TABLE: Next size: 4096 kB (1024 pages); nptotal: 10228 (40912 kB); nrows: 254610; ratio: 11.13
INDEX: Next size: 512 kB (128 pages); nptotal: 1280 (5120 kB); ratio: 11.11
TABLE: Next size: 4096 kB (1024 pages); nptotal: 11252 (45008 kB); nrows: 282889; ratio: 10.01
INDEX: Next size: 512 kB (128 pages); nptotal: 1408 (5632 kB); ratio: 10.00
TABLE: Next size: 8192 kB (2048 pages); nptotal: 12276 (49104 kB); nrows: 311209; ratio: 9.10 (4096 kB -> 8192 kB)
INDEX: Next size: 1024 kB (256 pages); nptotal: 1536 (6144 kB); ratio: 9.09 (512 kB -> 1024 kB)
TABLE: Next size: 8192 kB (2048 pages); nptotal: 14324 (57296 kB); nrows: 339501; ratio: 16.68
INDEX: Next size: 1024 kB (256 pages); nptotal: 1543 (6172 kB); ratio: 16.67
INDEX: Next size: 1024 kB (256 pages); nptotal: 1799 (7196 kB); ratio: 16.59

What I observe here for this table in a 4 kB space with an index also in a 4 kB space is:

  • The initial extent is created using the first extent size.
  • The second extent is created using the next extent size.
  • The next extent size doubles every time the table needs more space up to 128 pages, not kB as stated in the manual.
  • The extent size then doubles if the next extent size is less than 10% of the size of the object.
  • Identical logic applies to both tables and indices.

It’s easy to get confused between kB and pages here which is why I am showing both.

Just to show the 128 pages works for other page sizes, here is the same output using an 8 kB dbspace:

TABLE: Next size: 32 kB (4 pages); nptotal: 4 (32 kB); nrows: 1
INDEX: Next size: 32 kB (4 pages); nptotal: 4 (32 kB)
TABLE: Next size: 64 kB (8 pages); nptotal: 8 (64 kB); nrows: 172; ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 128 kB (16 pages); nptotal: 16 (128 kB); nrows: 410; ratio: 100.00 (64 kB -> 128 kB)
TABLE: Next size: 256 kB (32 pages); nptotal: 32 (256 kB); nrows: 877; ratio: 100.00 (128 kB -> 256 kB)
INDEX: Next size: 64 kB (8 pages); nptotal: 8 (64 kB); ratio: 100.00 (32 kB -> 64 kB)
TABLE: Next size: 512 kB (64 pages); nptotal: 64 (512 kB); nrows: 1776; ratio: 100.00 (256 kB -> 512 kB)
INDEX: Next size: 128 kB (16 pages); nptotal: 16 (128 kB); ratio: 100.00 (64 kB -> 128 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 128 (1024 kB); nrows: 3570; ratio: 100.00 (512 kB -> 1024 kB)
INDEX: Next size: 256 kB (32 pages); nptotal: 32 (256 kB); ratio: 100.00 (128 kB -> 256 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 256 (2048 kB); nrows: 7214; ratio: 100.00
INDEX: Next size: 512 kB (64 pages); nptotal: 64 (512 kB); ratio: 100.00 (256 kB -> 512 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 384 (3072 kB); nrows: 14466; ratio: 50.00
TABLE: Next size: 1024 kB (128 pages); nptotal: 512 (4096 kB); nrows: 21756; ratio: 33.33
TABLE: Next size: 1024 kB (128 pages); nptotal: 640 (5120 kB); nrows: 29041; ratio: 25.00
INDEX: Next size: 1024 kB (128 pages); nptotal: 128 (1024 kB); ratio: 100.00 (512 kB -> 1024 kB)
TABLE: Next size: 1024 kB (128 pages); nptotal: 768 (6144 kB); nrows: 36175; ratio: 20.00
TABLE: Next size: 1024 kB (128 pages); nptotal: 896 (7168 kB); nrows: 43515; ratio: 16.67
TABLE: Next size: 1024 kB (128 pages); nptotal: 1024 (8192 kB); nrows: 50751; ratio: 14.29
TABLE: Next size: 1024 kB (128 pages); nptotal: 1152 (9216 kB); nrows: 58019; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 256 (2048 kB); ratio: 100.00
TABLE: Next size: 1024 kB (128 pages); nptotal: 1280 (10240 kB); nrows: 65372; ratio: 11.11
TABLE: Next size: 1024 kB (128 pages); nptotal: 1408 (11264 kB); nrows: 72545; ratio: 10.00
TABLE: Next size: 2048 kB (256 pages); nptotal: 1536 (12288 kB); nrows: 79871; ratio: 9.09 (1024 kB -> 2048 kB)
TABLE: Next size: 2048 kB (256 pages); nptotal: 1792 (14336 kB); nrows: 87169; ratio: 16.67
TABLE: Next size: 2048 kB (256 pages); nptotal: 2048 (16384 kB); nrows: 101742; ratio: 14.29
TABLE: Next size: 2048 kB (256 pages); nptotal: 2304 (18432 kB); nrows: 116352; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 278 (2224 kB); ratio: 50.00
TABLE: Next size: 2048 kB (256 pages); nptotal: 2560 (20480 kB); nrows: 130862; ratio: 11.11
INDEX: Next size: 1024 kB (128 pages); nptotal: 406 (3248 kB); ratio: 46.04
TABLE: Next size: 2048 kB (256 pages); nptotal: 2816 (22528 kB); nrows: 145461; ratio: 10.00
TABLE: Next size: 4096 kB (512 pages); nptotal: 3072 (24576 kB); nrows: 160073; ratio: 9.09 (2048 kB -> 4096 kB)
TABLE: Next size: 4096 kB (512 pages); nptotal: 3584 (28672 kB); nrows: 174651; ratio: 16.67
INDEX: Next size: 1024 kB (128 pages); nptotal: 534 (4272 kB); ratio: 31.53
TABLE: Next size: 4096 kB (512 pages); nptotal: 4096 (32768 kB); nrows: 203750; ratio: 14.29
TABLE: Next size: 4096 kB (512 pages); nptotal: 4608 (36864 kB); nrows: 232818; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 662 (5296 kB); ratio: 23.97
TABLE: Next size: 4096 kB (512 pages); nptotal: 5120 (40960 kB); nrows: 261920; ratio: 11.11
TABLE: Next size: 4096 kB (512 pages); nptotal: 5632 (45056 kB); nrows: 290967; ratio: 10.00
INDEX: Next size: 1024 kB (128 pages); nptotal: 790 (6320 kB); ratio: 19.34
TABLE: Next size: 8192 kB (1024 pages); nptotal: 6144 (49152 kB); nrows: 320100; ratio: 9.09 (4096 kB -> 8192 kB)
TABLE: Next size: 8192 kB (1024 pages); nptotal: 7168 (57344 kB); nrows: 349442; ratio: 16.67
INDEX: Next size: 1024 kB (128 pages); nptotal: 918 (7344 kB); ratio: 16.20
TABLE: Next size: 8192 kB (1024 pages); nptotal: 8192 (65536 kB); nrows: 407578; ratio: 14.29
INDEX: Next size: 1024 kB (128 pages); nptotal: 1046 (8368 kB); ratio: 13.94
TABLE: Next size: 8192 kB (1024 pages); nptotal: 9216 (73728 kB); nrows: 465592; ratio: 12.50
INDEX: Next size: 1024 kB (128 pages); nptotal: 1138 (9104 kB); ratio: 12.24
TABLE: Next size: 8192 kB (1024 pages); nptotal: 10240 (81920 kB); nrows: 523746; ratio: 11.11
INDEX: Next size: 1024 kB (128 pages); nptotal: 1266 (10128 kB); ratio: 11.25
TABLE: Next size: 8192 kB (1024 pages); nptotal: 11264 (90112 kB); nrows: 581740; ratio: 10.00
INDEX: Next size: 1024 kB (128 pages); nptotal: 1394 (11152 kB); ratio: 10.11
TABLE: Next size: 16384 kB (2048 pages); nptotal: 12288 (98304 kB); nrows: 639785; ratio: 9.09 (8192 kB -> 16384 kB)
INDEX: Next size: 2048 kB (256 pages); nptotal: 1522 (12176 kB); ratio: 9.18 (1024 kB -> 2048 kB)
TABLE: Next size: 16384 kB (2048 pages); nptotal: 14336 (114688 kB); nrows: 697932; ratio: 16.67

Another feature of the engine is merging adjacent extents for the same object. This excerpt from an oncheck -pe report is from the 8 kB example:

Description Offset(p) Size(p)
------------------------------------------------------------- -------- --------
dbname:'benthompson'.t1 187897 14336
dbname:'benthompson'.ui_t1 221782 278
dbname:'benthompson'.ui_t1 223004 860
dbname:'benthompson'.ui_t1 234148 384

In my example the engine has managed to merge all the extents for the table into one and has merged the index extents into three (278 + 860 + 384 = 1522).

If you want to try out the test yourself the code is in the skybet/informix-helpers github repository

.


Monitoring Informix with Grafana

Introduction

In a presentation I gave at IIUG 2017 titled Making system monitoring better I showed, without much detail, how Grafana is a powerful tool for visualising what is happening within your Informix server. Performance metrics from your database server are collected at regular (usually 10 second) intervals and stored in a time-series database which can be used as the source for dashboards containing dynamic graphs and other ways of presenting the data. For the DBA the benefits are legion:

  • Quickly change the time range to zoom into when problems occurred or zoom out to see trends.
  • Correlate various database metrics and combine then with related operating system, network, storage or application metrics.
  • Get a truer picture of your busy periods, capacity, the effect of scheduled jobs etc.
  • Faster problem resolution because a lot of data can be visualised at once.

You might be able to think of some others.

The talk also touched on the CAMS acronym:

Culture
Automation
Measurement
Sharing

So you shouldn’t keep your dashboards to yourself: share them with other technical teams or everyone in your company. This has the added benefit of more eyes and others can learn to spot database problems, or when they are probably not database problems, by referring to these.

Why Grafana?

There are a number of tools which appear to do a similar job:

You perhaps haven’t heard of Brunia: it is the code name for a prototype monitoring tool that may replace Informix Open Admin Tool (OAT) in the future. It was demonstrated at IIUG 2017 and is probably closest to Prometheus in its execution. AGS Sentinel is the monitoring add-on to the popular ServerStudio suite for Informix. The rest are popular open source tools which other teams in your organisation are probably already using.

Some of the tools listed above can also produce events or alerts when a trigger condition occurs and automatically pass this up a stack to PagerDuty or another call-out system. An example of such a stack is Prometheus -> Alertmanager -> PagerDuty -> StatusPage

There are a lot of ways of implementing a full monitoring stack with choices to make about data collection, storing, visualisation, analysis and alerting. In this blog post I am going to concentrate on a simple set up where we collect Informix metrics, pass them to InfluxDB using a REST API and visualise in Grafana. For a fuller discussion of the benefits of the three open source technologies mentioned above I highly recommend reading this blog post from Loom Systems written in June 2017, Prometheus vs. Grafana vs. Graphite – A Feature Comparison.

In case you’re not going to read the LS blog it’s worth emphasising what the InfluxDB/Grafana solution I am about to describe does not provide:

  • There is little in the way of monitoring and alerting features.
  • Regression models enabling you to predict the value of a metric in the future are not available.
  • Advanced time series functions are not available.

The solution would be richer if Graphite was used as the data source and Grafana for visualisation only. This would provide more aggregation functions and allows you to do things like subtract one time series from another. As an example of what this might provide, I have a dashboard (not covered in this blog post) displaying the buffer turnover ratio and buffer waits ratio over an arbitrary moving window irrespective of when onstat -z was last run.

It is easy to confuse Graphite and Grafana, especially as both can be used independently, or Graphite can be a data source for Grafana.

As this is an Informix blog I ought to explain why I am using InfluxDB and not Informix time series? The simple answer is that to use Informix time series with Grafana properly someone would have to write and maintain a data source plugin for it like the one for InfluxDB we’ll be using. Doing so would give something more feature rich than InfluxDB for sure but perhaps not much more powerful than a Graphite/Grafana combination.

What to monitor

Potentially anything we can put a value to every ten seconds can be collected and stored in InfluxDB (which is a statement you can make about time series collections in general). For Linux operating system metrics there is a well-established collection daemon called collectd and, if I had better C programming skills, I could a collectd plugin for Informix.

For Informix systems the most obvious source is the system monitoring interface (SMI) which is the presentation of information held in shared memory through pseudo-tables in the sysmaster database. This covers the vast majority of what can be collected using onstat but is easier to handle in a programming language. Doing it this way means we can also collect real table data in the same manner.

For example the system profile or onstat -p can be captured with the following SQL:

SELECT
TRIM(name) AS name,
value
FROM
sysmaster:sysprofile
WHERE
name IN ('dskreads', 'bufreads', 'dskwrites', 'bufwrites', 'isamtot', 'isopens', 'isstarts', 'isreads', 'iswrites', 'isrewrites', 'isdeletes', 'iscommits', 'isrollbacks', 'latchwts', 'buffwts', 'lockreqs', 'lockwts', 'ckptwts', 'deadlks', 'lktouts', 'numckpts', 'plgpagewrites', 'plgwrites', 'llgrecs', 'llgpagewrites', 'llgwrites', 'pagreads', 'pagwrites', 'flushes', 'compress', 'fgwrites', 'lruwrites', 'chunkwrites', 'btraidx', 'dpra', 'rapgs_used', 'seqscans', 'totalsorts', 'memsorts', 'disksorts', 'maxsortspace')

It’s important to understand that all of these metrics are what I’d term counters. That is they only increase over time (unless they get so large they run out of bits and wrap or a DBA runs onstat -z). It gets difficult to see on a graph the difference between, say, 2394472 and 2394483 and so it’s useful to calculate a delta over the ten second window. Some things you might collect are automatically suitable for graphing because they are gauges: an example of this is the number of threads in your ready queue at any given moment.

Implementation

Practical demonstration with Docker containers

Nothing better than an example you can try at home (or work!). In the implementation example I will be using the IBM Informix Developer Edition Docker container which, at time of writing, runs Debian 8 (jeesie) and a second Docker container for InfluxDB and Grafana. You’ll of course need Docker installed on your laptop or workstation for this to work.

What this demonstration is going to build will look like the above. A collector script will collect metrics from Informix at a regular interval and post the results to InfluxDB. You will be able to use your usual web browser to connect to Grafana and visualise the data. Sounds simple?

We’ll start by setting up the InfluxDB/Grafana Docker container which will be also be using on a (minimal) Debian installation. In a terminal run:

docker pull debian
docker run -it --name influx_grafana_monitoring -p 8086:8086 -p 3000:3000 --hostname grafserv -e "GF_SECURITY_ADMIN_PASSWORD=secret" debian

Your terminal should now be inside the Docker container and logged in as root. Run these commands to install some extra packages and then InfluxDB:

apt-get update
apt-get -y install curl gnupg apt-transport-https procps
curl -sL https://repos.influxdata.com/influxdb.key | apt-key add -
echo "deb https://repos.influxdata.com/debian jessie stable" | tee -a /etc/apt/sources.list
apt-get update
apt-get -y install influxdb

Next install Grafana in the container:

echo "deb https://packagecloud.io/grafana/stable/debian/ jessie main" | tee -a /etc/apt/sources.list
curl https://packagecloud.io/gpg.key | apt-key add -
apt-get update
apt-get -y install grafana

Start the both services inside the container:

/etc/init.d/influxdb start
/etc/init.d/grafana-server start

We need to create an Influx database to store our time series data and we can do this with a REST API call:

curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE informix"

If it works you should see a HTTP/1.1 200 OK response.

You should now be able to access the Grafana server running in your Docker container from your usual web browser at http://localhost:3000/

Log in with the user name admin and the password secret. Once logged in click Add data source and fill in the settings as follows (some of them are case-sensitive):

Name
informix
Type
InfluxDB

HTTP settings

URL
http://localhost:8086
Access
direct

HTTP Auth

Basic auth
Leave unticked
With credentials
Leave unticked

InfluxDB Details

Database
informix
User
Leave blank
Password
Leave blank
Min time interval
Leave at 10s

All being well you should see Data source is working in a big green box.

Now we are going to set up the Informix container to monitor. On your workstation in another terminal run:

$ docker pull ibmcom/informix-developer-database
docker run -it --name iif_developer_edition --privileged -p 9088:9088 -p 9089:9089 -p 27017:27017 -p 27018:27018 -p 27883:27883 --hostname ifxserver -e LICENSE=accept ibmcom/informix-developer-database:latest

The command above should provide a running Informix instance which may take a few moments after which control is passed back to the terminal. We are now going to get the scripts that will send data to InfluxDB:

sudo -i
apt-get update
apt-get -y install git libdbi-perl libjson-perl libwww-curl-perl make gcc libtest-pod-perl

We need to get the Perl DBD::Informix package from CPAN which will download, compile, test and install it for us.

. /home/informix/ifx_dev.env
export DBI_DBNAME=sysmaster
export DBD_INFORMIX_DATABASE=sysmaster
export DBD_INFORMIX_USERNAME=informix
export DBD_INFORMIX_PASSWORD=in4mix
cpan

Enter ‘yes’ to configure as much as possible. In the CPAN console type the case-sensitive command:

install DBD::Informix

There is quite a lot that could go wrong in the CPAN console but it should work if you’re using the IBM Informix DE Docker container and follow the steps exactly. If you’re installing on RedHat Linux or a derivative the required RPM package names you use with yum install will all be different.

Type logout to exit the root shell. You should be logged in as user informix again. Leave this session for later.

Run on your local workstation (outside both Docker containers) in another terminal window:

git clone https://github.com/skybet/informix-helpers.git

This assumes you have git installed. There are two important files in the post_to_influxdb directory:

  • informix_smi_influx_uploader
  • informix_smi.json

You will need to edit informix_smi.json and change all references from mydatabase to the name of the user/application database you want to monitor. For the purposes of the blog post in this demo, we are just going to monitor the sysmaster database so change all mydatabase references to sysmaster.

You can copy the files to your Informix Docker container as follows. To get the name of your Informix Docker container (not its hostname) simply type docker ps on your workstation.

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a9c73712429 debian "bash" 8 minutes ago Up 8 minutes 0.0.0.0:3000->3000/tcp, 0.0.0.0:8086->8086/tcp influx_grafana_monitoring
e1c178b57164 ibmcom/informix-developer-database:latest "/bin/bash informi..." 13 minutes ago Up 13 minutes 0.0.0.0:9088-9089->9088-9089/tcp, 0.0.0.0:27017-27018->27017-27018/tcp, 0.0.0.0:27883->27883/tcp iif_developer_edition

From the above my Informix container name is e1c178b57164

docker cp informix_smi_influx_uploader e1c178b57164:/home/informix
docker cp informix_smi.json e1c178b57164:/home/informix

We should be ready to start collecting metrics and posting them to InfluxDB. Run in the Informix container as user informix:

cd
./informix_smi_influx_uploader -c ./informix_smi.json -i 10 -u http://other_container_ip:8086/write?db=informix

Change other_container_ip to the IP address of your InfluxDB/Grafana container. You must use the IP address unless you have name resolution which this basic Docker set up does not. If you don’t know what this is you can ping the docker container name from inside the InfluxDB/Grafana container using a command like ping -c 1 grafserv

All being well the Perl script should run continuously and collect and post data to InfluxDB every 10 seconds or whatever interval you specified with the -i switch.

To see anything in Grafana we’ll need to set up a dashboard. The file informix_server.json in the grafana_dashboard directory describes a Grafana dashboard. You’ll need to edit it a bit first and change all occurrences of the following:

<%= @hostname %>
hostname of your Informix docker container, normally the unqualified hostname of your Informix server
<%= @informixserver %>
Name of your Informix instance, dev

In the Grafana web browser click the Grafana logo and then Dashboards. Click Import. At Import Dashboard click Upload .json File. Hey presto, you should have a dashboard with graphs. Some may not display any data, e.g. Temporary dbspace usage, because there are no temporary dbspaces in the development Docker image by default.

Making this ready for production

There are a few bits I’ll leave to you for any production implementation:

  • The collection shouldn’t run as user informix. Create a user for the monitoring and give it just the CONNECT and SELECT privileges it needs.
  • You’ll also need to write a script to start/stop the collection with the instance.
  • Linux operating system statistics, gathered through collectd would complement the dashboard very nicely.
  • You’ll probably want to customise both the JSON file containing the SMI queries and the one describing the dashboard. You could add application metrics than can be collected from the database or create multiple dashboards especially if you don’t like the idea of one big one showing everything. Bear in mind any queries you write need to be fast and will run every 10 seconds.
  • In the dashboard you may need to add/remove different page sizes to/from the buffer, page and disk reads/writes graphs.

Conclusion

It can be very useful to track virtual segment and temp. space usage on your server and correlate with events like update stats, ontape/onbar backups or application activity. You can use other tools to do this but these often are not as accessible or are purely in the realm of the DBA. A Grafana dashboard like the one described here should be very useful for you and colleagues, especially if they have their own dashboards on the same system which allow you to view your systems as a whole, and it might go some distance to demystifying Informix in your organisation.


Passwords encrypted over the network: why is this feature not enabled by default?

In 2015 I wrote a blog post about using simple password encryption (SPW) and how – without it – your valuable passwords can be trivially sniffed on your network. If you look through the post it illustrates the vulnerability and shows just how easy it is to set your system up in a more secure way.

SPW only encrypts your password when you connect. Not everyone wants or needs full encryption of all their traffic but what reasons are there not to use SPW?

  • It requires a small amount of set up extra work, although this can (should?) be automated.
  • It means your database engine spawns some extra cssmbox_cn threads, although they are only used at connection time and the overhead is low.
  • Consideration should be given to patching the IBM Global Security Kit (GSKit) separately from the server and client, both of which bundle it.

I don’t know of any other drawbacks. In my opinion these are nothing substantive then when you consider your peace of mind.

If you have Fix Central access you can always download the latest GSKit from here. Although it’s used by many IBM products it’s filed under Tivoli which isn’t obvious at all.

Patching the GSKit separately isn’t necessarily something you need to do but it isn’t only used by SPW: if you’ve set ENCRYPT_HDR, ENCRYPT_SMX or ENCRYPT_CDR, for example, you are using it. The GSKit doesn’t get installed in INFORMIXDIR; it’s installed by RPM (on Linux) to /usr/local/ibm and only one version can exist on your server. So if you’re used to pre-installing a new version of Informix server or Client SDK in its own folder prior to an upgrade, be aware that you may just have unwittingly upgraded the GSKit.

The feature has suffered a few issues lately and is currently broken when used with the Informix JDBC driver in 11.70.xC9; connections supported by CSDK or IConnect work fine. I think the feature would be more dependable if more people used it (or if the product testing stress tested this area). Here are some relatively recent issues:

  • All recent JDBC drivers earlier than 4.10.JC8 (including 3.70.JC8W1) suffer from an issue where a small proportion of connections will fail. You might not notice this if your application can capture logon failures and retry automatically. There is no APAR for this that I know of as 4.10.JC8 was extensively reworked for JDBC 4.0 support.
  • Informix 11.70.xC9 contains fix IT10493 but this caused a high rate of logon failures with SPW and fix IT17087 is additionally needed but not included.
  • If you’re using the 12.10 code line you need xC8 or later to get the same fix.
  • CSDK 4.10.FC8 ships with an incompatible GSKit version, 8.0.50.66, but actually requires 8.0.50.70+ (APAR IT18763). You may not notice this, however, if your server software ships with a later version.

I hope this doesn’t come across as a moan, more a call to action.


When do my stored procedure execution plans get updated?

For the sake of brevity in this article I am going to group procedures, functions and routines together as stored procedures and ignore any differences between them.

What does the SQL command UPDATE STATISTICS FOR PROCEDURE/FUNCTION/ROUTINE does and perhaps, more pertinently, as a DBA do I need to run this regularly to ensure my systems are working efficiently? For those wanting an immediate answer I think it is “never” or “almost never“, the reasons for which I hope to explain clearly in this article.

The command itself is straightforward: calling it causes Informix to parse a stored procedure’s code and produce a query plan for all of it based on the current statistics and data distributions (if the procedure references any tables). It then writes the query plan to the sysprocplan table which is, unless you have an unlogged database, a logged operation written to the logical log. Used with no parameters it does this for all stored procedures in the system.

As long as the query plan in sysprocplan is reasonably efficient there is probably no need to (ever) proactively update it but there may be cases when you’d want to do so, for example, if a very small or empty table has grown into a large one. However if you were to do this your new plan would be based on the current table statistics and data distributions and if these haven’t been updated yet you may get the same, now inefficient, plan.

The manual states:

The sysprocplan system catalog table stores execution plans for SPL routines. Two actions can update the sysprocplan system catalog table:

  • Execution of an SPL routine that uses a modified table
  • The UPDATE STATISTICS FOR ROUTINE, FUNCTION, or PROCEDURE statement.

There is a created column in the sysprocplan table but it’s a date and not a date/time which makes it much harder to match plan updates to other events.

So what is a modified table? Quite simply it is one where the version number has been incremented. You can see the version number with an SQL query like:

select version from systables where owner='myowner' and tabname='mytable';

I think the only reference to this in manual is in section about the systables view where it simply says:

version
INTEGER
Number that changes when table is altered

How the engine works out the dependencies a stored procedure has on different tables falls into the category of system internals, which IBM chooses not to publicly document, but I think it’s safe to say that if a table is referenced anywhere in a procedure it is dependent on it.

There are many ways a table can be “altered”, some more obvious than others:

Method Version number incremented by
GRANT SELECT 1
GRANT UPDATE 1
GRANT SELECT 1
GRANT SELECT, UPDATE 1
UPDATE STATISTICS HIGH 2
UPDATE STATISTICS MEDIUM 2
UPDATE STATISTICS [LOW] 2
RENAME COLUMN 65536
CREATE INDEX 65536
ADD column 131072
DROP column 131072
GRANT CONNECT 0
GRANT RESOURCE 0
GRANT DBA 1

I am not sure why some operations increment the value by large numbers, all powers of 2, as any increment has a similar effect, at least as far as the scope of this article is concerned.

The table is not a complete list because there are many possible DDL operations but this does already illustrate or suggest that:

  • On most systems it’s likely that UPDATE STATISTICS commands will be the main trigger for stored query plans to be updated. If you run LOW, HIGH and MEDIUM modes for a table like you will if you use AUS or dostats, you’ll trigger at least three updates for dependent stored procedures (if they are called).
  • If we want to grant multiple privileges on the same table, it’s best to do it in a single statement because if a dependent stored procedure is being called in between running commands by an application, its stored execution plan will be updated only once.
  • GRANT DBA is not a table level operation yet it has an effect.

Further testing shows that both the GRANT DBA and REVOKE DBA statements increment the version number on all tables in the same database where the tabid is 100 or greater, that is all user tables. From the manual above it follows that the stored query plans for all stored procedures or functions dependent on a table will be updated the next time they are executed.

On our systems we see a large amount of writes to sysprocplan after granting or revoking the DBA privilege to anyone. When graphed we see a sharp peak and a long exponential tail off as less commonly used procedures get called.

Therefore if you grant DBA to a user on a busy live system, it can affect concurrency more than you might expect. On an idle system you may want to run UPDATE STATISTICS FOR PROCEDURE immediately afterwards to update the stored query plans in an orderly way and save the first session to call any given procedure from this overhead.

I think running the command offline to avoid the overhead for user or application sessions is possibly the only true use case for this command.


Improving remote query performance by tuning FET_BUF_SIZE

I thought I’d write blog post as a nice example of where tuning the client-side variable, FET_BUF_SIZE, really speeded up a remote query.

FET_BUF_SIZE is documented by IBM in the context of a Java application using JDBC here and as a server environment variable here.

One thing the documentation warns about is that simply setting this to a high value may degrade performance, especially if you have a lot of connections. With that in mind here are some facts about the query I’m running and using as a basis for these tests:

  • I am just using a single connection to the database.
  • the query returns around 10000 rows and 60 Mb of data.
  • the client and the server are geographically separated from each other and Art Kagel’s dbping utility typically takes around 0.1 seconds to connect remotely; this compares with around 3 milliseconds locally.
  • crucially the query runs in seconds locally on the server but takes over three minutes when run remotely.

If I begin running the query with the default value of FET_BUF_SIZE and monitor waits on the server, I can see that reads only go up slowly and that my session is waiting on a condition (indicated by the Y in position one of column two) more or less all the time:

> while [ 1 ] ; do
> onstat -u | grep thompson
> sleep 1
> done
Userthreads
address flags sessid user tty wait tout locks nreads nwrites
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 552 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 552 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 560 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 560 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 568 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 576 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 592 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 624 0
26eb492d18 Y--P-R- 76228 thompson 0 26e67cd298 0 0 624 0

The sixth column shows the rstcb value of the thread I’m waiting on. I can use onstat -g con (print conditions with waiters) to see that I’m waiting on the network:

> onstat -g con | grep -E '^cid|26e67cd298'
cid addr name waiter waittime
5789 26e67cd298 netnorm 84353 0

A quick check with onstat -g ses 76228 shows that thread id. 84353 does indeed correspond to my session.

While the wait time shown above is not increasing it’s a different story when we look at netstat, again on the server:

> netstat -nc | grep '172.16.0.1'
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 1312 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1284 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1306 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1302 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1194 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1206 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1266 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1304 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1318 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED
tcp 0 1248 10.0.0.1:9088 172.16.0.1:37004 ESTABLISHED

What the above is showing us is that there are consistently around 1200 to 1300 bytes in the send queue (Send-Q). This is surely our bottleneck.

At this point when investigating the problem I considered modifying other parameters such as OPTOFC and Linux kernel parameters. However with a few moment’s thought it was clear these weren’t going to gain anything: OPTOFC optimises the open-fetch-close sequence and for a single long running query this is not going to give us anything measurable; and an investigation into increasing the Linux kernel parameter related to the send queue size was dismissed when we found that 1300 bytes was well below the maximum allowed.

In Informix 11.50 the maximum value of FET_BUF_SIZE is 32767 (32 kb) but this is increased to 2147483648, or as we’ll see actually 2147483647, (2 Gb) in 11.70 and above. We can therefore move onto to experiment with different values:

FET_BUF_SIZE Query run time (s) Average Send-Q size over 10 samples Maximum Send-Q size observed
Default 221.2 1274 1332
1024 221.1 1255 1326
2048 221.1 1285 1338
4096 221.2 1297 1360
6144 102.1 2564 2676
8192 56.6 5031 5210
16384 22.6 12490 13054
32767 (max. 11.50 value) 11.5 24665 29968
65536 7.0 62188 62612
131072 4.9 115793 127826
262144 4.0 146686 237568
524288 3.5 184320 249856
1048576 3.3 245760 473616
2097152 3.2 249856 486352
2147483647 (max. value – 1) 3.0 245760 549352
2147483648 (supposed max. value) 221.3 1276 1366

As the run times get shorter it gets tricky to measure the Send-Q using netstat -nc: it can be sampled very frequently using a command like:

while [ 1 ] ; do
netstat -n | grep '172.16.0.1'
done

This will produce many measurements per second and with this it’s possible to see it fill up and drain several times in the period while the statement is running.

It’s also interesting to play around with the boundaries. For example, with a FET_BUF_SIZE between around 5500 and 5600 maximum Send-Q sizes the same as those consistently achieved with a FET_BUF_SIZE of 6144 begin to creep into the results but many measurements remain around the values consistently measured wit a FET_BUF_SIZE of 4096:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 1316 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1318 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1278 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1352 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1288 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 2546 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1278 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 2502 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1266 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1314 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 2506 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED
tcp 0 1292 10.0.0.1:9088 172.16.0.1:37488 ESTABLISHED

So what are the conclusions?

  • Increasing FET_BUF_SIZE at the client side can dramatically improve the speed of remote queries.
  • Maximum Send-Q sizes, as measured by netstat, increase in discrete steps as FET_BUF_SIZE is increased.
  • A larger Send-Q allows more data to be cached and reduces waits seen in Informix.
  • To see any improvement at all FET_BUF_SIZE must be increased to at least 6000 (approximate value).
  • Around boundaries between maximum Send-Q sizes there appears to be a cross-over region where maximum send queue sizes overlap from two adjacent values are seen from one second to the next.
  • The maximum value allowed in 11.70 at least is 2147483647 and not 2147483648, as indicated in the documentation.
  • The maximum 11.50 value of 32767 produced a run time nearly 4x slower than an optimised value for 11.70+
  • Other testing I did, not documented here, shows that the results are uniform across JDBC and ESQL/C applications.

Note: all user names, IP addresses and port numbers used in this post have been altered.