Forest of trees

Posted: 7 September, 2018 | Author: Ben Thompson | Filed under: Instance and operating system tuning, SQL query tuning, Storage schema management | Tags: forest of trees, FOT | 2 Comments

Introduction

Forest of trees (FOT) indices were a new feature in 11.70 and are designed to avoid multiple sqlexec threads competing for access to the same index root node (top level of the index) and wasting CPU cycles. When a thread cannot access the root node because it is latched by another thread, it will spin and re-try. This informs the approach taken by the Informix official documentation to finding where a FOT will be beneficial: checking for database objects (that happen to be indices) with the most spin locks.

You can create a FOT like this:
CREATE INDEX fotidx ON tab(c1) hash on (c1) with 100 buckets;

The idea behind FOTs is to split the index into many “buckets”, each with its own root node, improving concurrency by reducing competition between threads. A hash function on a specified column is used to determine in which bucket to place to retrieve or place a row, which in most cases can only practicably be a hash on the leading column of the index. From this it follows that the leading column needs to have a lot of distinct values and not be skewed otherwise the hash function will return the same value frequently negating the potential benefits. Certainly it is pointless to have more buckets than you have distinct values on the hashed column and, unless you want a set of index buckets consisting of a single value, you’ll want the number of buckets to be a fraction of the distinct values since different column values can of course hash to the same bucket.

If you’ve so far ignored this feature do you need it? What can be the benefits of implementing a FOT given there are some potential drawbacks such as not being able to do range scans on the hashed column? What other symptoms are there? And is the implementation as good as it could be?

Do I need one?

As the problem FOTs attempt to solve is competition between threads for access to the same index root nodes, it follows that you are unlikely to ever need one on systems with one or only a few CPU virtual processors (CPU VPs). Without multiple CPU VPs it is impossible for multiple sqlexec threads to be accessing the same buffers at the same time. It also follows that you need a fair number of database connections and for these to be doing similar things. By similar I don’t mean exactly the same because the threads would still compete for the same root node. A good example of where a FOT would work well is where multiple sessions select all the data for a customer by customer id.: the same query is being run in many sessions but the customer id. is different. It’s worth mentioning that this contention occurs purely in memory on the database server; and the performance problem the feature addresses is not related to fetching rows from storage.

What are the benefits?

At the instance level, a large number of spin locks manifests itself as high CPU usage without the database server doing much useful work. Of course there can be many causes of high CPU usage such as poor SQL, poor query plans, slow disk etc. but a feature of systems suffering this kind of contention is that they don’t scale linearly. If an index is not very busy the FOT is not giving you any benefits over a standard B-Tree index but as load ramps up and an index is used more intensively by multiple concurrent threads, spin waits with a standard B-Tree can increase dramatically consuming CPU cycles for no real gain.

What other symptoms are there of spin waits caused by index root node contention?

CPU load suddenly jumping up when load hits a certain level.
A large number of buffer waits (‘onstat -X‘) on the index concerned.
Many running sqlexec threads are all running the same SQL statements. You may see these threads in this state:
tid name rstcb flags curstk status 5362439 sqlexec 1f519b358 ---PR-- 11888 running-

Is the implementation as good as it could be?

There are a couple of missing features in the implementation, both around the creation of the index itself rather than accessing or updating it.:

While you can allocate more memory to the session building the index by setting PDQPRIORITY building an FOT is always a single-threaded operation, meaning it can be slow with large tables.
Rebuilding an index as an FOT via “alter fragment on index” is not supported, meaning you can’t rebuild the index and leave the logical schema intact. Instead you must drop and recreate the index. This isn’t too painful if the index is there for performance reasons only but if it’s supporting constraints you need to drop/recreate these too and, if it’s a primary key, any referencing foreign keys will be implicitly dropped. You can find any foreign keys referencing a given table with:
SELECT t.tabname, c.constrname FROM sysconstraints c, systables t WHERE c.tabid=t.tabid AND c.constrid IN ( SELECT constrid FROM sysreferences WHERE ptabid IN ( SELECT tabid FROM systables WHERE tabname='XXXXXX' ) );
Then use dbschema to fetch the SQL to recreate the foreign keys.

If you think these improvements to the implementation would be beneficial to you, there is an IBM RFE you can vote for.

Summary

This is a feature that you may never need and aimed at solving a specific type of performance problem very effectively.

Extent size doubling

Posted: 13 July, 2018 | Author: Ben Thompson | Filed under: Instance and operating system tuning | Tags: doubling, extent, fextsiz, informix, nextsiz | 3 Comments

This post was inspired a discussion one of the IIUG Technical Special Interest Groups, titled Outrageous Extent sizes.

When you write to an Informix table or index (referred to collectively as an “object”) the Informix engine writes to an extent, a continuous allocation of disk of a fixed size dedicated to that table or index. Informix has a limit on the number of extents an object may have which depends on the page size: around 200 extent for a 2 kB dbspace rising to around 2000 for a 16 kB dbspace.

To ensure space is not wasted new extents start small and then the engine increases the space it allocates as the table grows.

In versions of Informix now unsupported the algorithm was to allocate four extents of the initial extent size and then start using the next extent size. The next extent size doubled every sixteen allocations. Unfortunately this algorithm was slow to “get going” and objects, especially those in 2 kB dbspaces, could quickly gain a lot of extents and approach the extent limit.

To avoid this problem the DBA could modify the first and next extent sizes to suit, which was useful especially if you could anticipate future growth.

In recent versions of Informix the algorithm for allocating extents to tables has changed and become more aggressive, the result being less intervention required by a DBA but perhaps unwanted surprises when large amounts of unallocated space are assigned to objects whenever a new extent is created.

The Informix manual states with regard to extent size doubling:

For permanent tables or user-defined temporary tables, the size of the next extent for every allocation is automatically doubled. The size doubles up to 128 kilobytes (KB). For example, if you create a table with the NEXT SIZE equal to 15 KB, the database server allocates the first extent at a size of 15 KB. The next extent is allocated at 30 KB, and the extent after that is allocated at 60 KB. When the extent size reaches 128 KB, the size is doubled only when the remaining space in the table is less than 10% of the total allocated space in the table.

For system-created temporary tables, the next-extent size begins to double after 4 extents have been added.

I found this explanation kind of confusing and – as we’ll see – it’s slightly wrong anyway.

As ever I am going to use Perl, a DBA favourite, to investigate what is happening. I am going to create a simple table as follows:

create table t1 ( myid bigserial not null, mydata varchar(254) not null ) in my4kspace extent size 16 next size 16 ;

create unique index ui_t1 on t1 (myid) in my4kspace;

Then what I am going to do is insert random data one row at a time into the mydata column and, after every insert, check the next extent size used by referencing the partnum in sysmaster:sysptnhdr. If a new extent is added or the next extent size changes I’m going to print a one-line summary and do this for both the table and its index.

It’s worth mentioning that systnphdr records the actual extent sizes being used, not systables which contains the value last set using SQL or when the table was created.

Here are my results:

TABLE: Next size: 16 kB (4 pages); nptotal: 4 (16 kB); nrows: 1 INDEX: Next size: 16 kB (4 pages); nptotal: 4 (16 kB) TABLE: Next size: 32 kB (8 pages); nptotal: 8 (32 kB); nrows: 98; ratio: 100.00 (16 kB -> 32 kB) TABLE: Next size: 64 kB (16 pages); nptotal: 16 (64 kB); nrows: 207; ratio: 100.00 (32 kB -> 64 kB) TABLE: Next size: 128 kB (32 pages); nptotal: 20 (80 kB); nrows: 424; ratio: 100.00 (64 kB -> 128 kB) INDEX: Next size: 32 kB (8 pages); nptotal: 8 (32 kB); ratio: 100.00 (16 kB -> 32 kB) TABLE: Next size: 256 kB (64 pages); nptotal: 52 (208 kB); nrows: 529; ratio: 160.00 (128 kB -> 256 kB) TABLE: Next size: 512 kB (128 pages); nptotal: 116 (464 kB); nrows: 1402; ratio: 123.08 (256 kB -> 512 kB) INDEX: Next size: 64 kB (16 pages); nptotal: 16 (64 kB); ratio: 100.00 (32 kB -> 64 kB) TABLE: Next size: 512 kB (128 pages); nptotal: 244 (976 kB); nrows: 3157; ratio: 110.34 INDEX: Next size: 128 kB (32 pages); nptotal: 32 (128 kB); ratio: 100.00 (64 kB -> 128 kB) TABLE: Next size: 512 kB (128 pages); nptotal: 372 (1488 kB); nrows: 6636; ratio: 52.46 INDEX: Next size: 256 kB (64 pages); nptotal: 64 (256 kB); ratio: 100.00 (128 kB -> 256 kB) TABLE: Next size: 512 kB (128 pages); nptotal: 500 (2000 kB); nrows: 10162; ratio: 34.41 TABLE: Next size: 512 kB (128 pages); nptotal: 628 (2512 kB); nrows: 13697; ratio: 25.60 INDEX: Next size: 512 kB (128 pages); nptotal: 128 (512 kB); ratio: 100.00 (256 kB -> 512 kB) TABLE: Next size: 512 kB (128 pages); nptotal: 756 (3024 kB); nrows: 17255; ratio: 20.38 TABLE: Next size: 512 kB (128 pages); nptotal: 884 (3536 kB); nrows: 20741; ratio: 16.93 TABLE: Next size: 512 kB (128 pages); nptotal: 1012 (4048 kB); nrows: 24332; ratio: 14.48 TABLE: Next size: 512 kB (128 pages); nptotal: 1140 (4560 kB); nrows: 27883; ratio: 12.65 INDEX: Next size: 512 kB (128 pages); nptotal: 256 (1024 kB); ratio: 100.00 TABLE: Next size: 512 kB (128 pages); nptotal: 1268 (5072 kB); nrows: 31463; ratio: 11.23 TABLE: Next size: 512 kB (128 pages); nptotal: 1396 (5584 kB); nrows: 34921; ratio: 10.09 TABLE: Next size: 1024 kB (256 pages); nptotal: 1524 (6096 kB); nrows: 38471; ratio: 9.17 (512 kB -> 1024 kB) TABLE: Next size: 1024 kB (256 pages); nptotal: 1780 (7120 kB); nrows: 41965; ratio: 16.80 TABLE: Next size: 1024 kB (256 pages); nptotal: 2036 (8144 kB); nrows: 49007; ratio: 14.38 TABLE: Next size: 1024 kB (256 pages); nptotal: 2292 (9168 kB); nrows: 56089; ratio: 12.57 INDEX: Next size: 512 kB (128 pages); nptotal: 384 (1536 kB); ratio: 50.00 TABLE: Next size: 1024 kB (256 pages); nptotal: 2548 (10192 kB); nrows: 63151; ratio: 11.17 TABLE: Next size: 1024 kB (256 pages); nptotal: 2804 (11216 kB); nrows: 70325; ratio: 10.05 TABLE: Next size: 2048 kB (512 pages); nptotal: 3060 (12240 kB); nrows: 77402; ratio: 9.13 (1024 kB -> 2048 kB) TABLE: Next size: 2048 kB (512 pages); nptotal: 3572 (14288 kB); nrows: 84473; ratio: 16.73 INDEX: Next size: 512 kB (128 pages); nptotal: 512 (2048 kB); ratio: 33.33 TABLE: Next size: 2048 kB (512 pages); nptotal: 4084 (16336 kB); nrows: 98674; ratio: 14.33 TABLE: Next size: 2048 kB (512 pages); nptotal: 4596 (18384 kB); nrows: 112742; ratio: 12.54 INDEX: Next size: 512 kB (128 pages); nptotal: 640 (2560 kB); ratio: 25.00 TABLE: Next size: 2048 kB (512 pages); nptotal: 5108 (20432 kB); nrows: 126934; ratio: 11.14 TABLE: Next size: 2048 kB (512 pages); nptotal: 5620 (22480 kB); nrows: 141188; ratio: 10.02 INDEX: Next size: 512 kB (128 pages); nptotal: 768 (3072 kB); ratio: 20.00 TABLE: Next size: 4096 kB (1024 pages); nptotal: 6132 (24528 kB); nrows: 155312; ratio: 9.11 (2048 kB -> 4096 kB) TABLE: Next size: 4096 kB (1024 pages); nptotal: 7156 (28624 kB); nrows: 169379; ratio: 16.70 INDEX: Next size: 512 kB (128 pages); nptotal: 896 (3584 kB); ratio: 16.67 TABLE: Next size: 4096 kB (1024 pages); nptotal: 8180 (32720 kB); nrows: 197862; ratio: 14.31 INDEX: Next size: 512 kB (128 pages); nptotal: 1024 (4096 kB); ratio: 14.29 TABLE: Next size: 4096 kB (1024 pages); nptotal: 9204 (36816 kB); nrows: 226153; ratio: 12.52 INDEX: Next size: 512 kB (128 pages); nptotal: 1152 (4608 kB); ratio: 12.50 TABLE: Next size: 4096 kB (1024 pages); nptotal: 10228 (40912 kB); nrows: 254610; ratio: 11.13 INDEX: Next size: 512 kB (128 pages); nptotal: 1280 (5120 kB); ratio: 11.11 TABLE: Next size: 4096 kB (1024 pages); nptotal: 11252 (45008 kB); nrows: 282889; ratio: 10.01 INDEX: Next size: 512 kB (128 pages); nptotal: 1408 (5632 kB); ratio: 10.00 TABLE: Next size: 8192 kB (2048 pages); nptotal: 12276 (49104 kB); nrows: 311209; ratio: 9.10 (4096 kB -> 8192 kB) INDEX: Next size: 1024 kB (256 pages); nptotal: 1536 (6144 kB); ratio: 9.09 (512 kB -> 1024 kB) TABLE: Next size: 8192 kB (2048 pages); nptotal: 14324 (57296 kB); nrows: 339501; ratio: 16.68 INDEX: Next size: 1024 kB (256 pages); nptotal: 1543 (6172 kB); ratio: 16.67 INDEX: Next size: 1024 kB (256 pages); nptotal: 1799 (7196 kB); ratio: 16.59

What I observe here for this table in a 4 kB space with an index also in a 4 kB space is:

The initial extent is created using the first extent size.
The second extent is created using the next extent size.
The next extent size doubles every time the table needs more space up to 128 pages, not kB as stated in the manual.
The extent size then doubles if the next extent size is less than 10% of the size of the object.
Identical logic applies to both tables and indices.

It’s easy to get confused between kB and pages here which is why I am showing both.

Just to show the 128 pages works for other page sizes, here is the same output using an 8 kB dbspace:

TABLE: Next size: 32 kB (4 pages); nptotal: 4 (32 kB); nrows: 1 INDEX: Next size: 32 kB (4 pages); nptotal: 4 (32 kB) TABLE: Next size: 64 kB (8 pages); nptotal: 8 (64 kB); nrows: 172; ratio: 100.00 (32 kB -> 64 kB) TABLE: Next size: 128 kB (16 pages); nptotal: 16 (128 kB); nrows: 410; ratio: 100.00 (64 kB -> 128 kB) TABLE: Next size: 256 kB (32 pages); nptotal: 32 (256 kB); nrows: 877; ratio: 100.00 (128 kB -> 256 kB) INDEX: Next size: 64 kB (8 pages); nptotal: 8 (64 kB); ratio: 100.00 (32 kB -> 64 kB) TABLE: Next size: 512 kB (64 pages); nptotal: 64 (512 kB); nrows: 1776; ratio: 100.00 (256 kB -> 512 kB) INDEX: Next size: 128 kB (16 pages); nptotal: 16 (128 kB); ratio: 100.00 (64 kB -> 128 kB) TABLE: Next size: 1024 kB (128 pages); nptotal: 128 (1024 kB); nrows: 3570; ratio: 100.00 (512 kB -> 1024 kB) INDEX: Next size: 256 kB (32 pages); nptotal: 32 (256 kB); ratio: 100.00 (128 kB -> 256 kB) TABLE: Next size: 1024 kB (128 pages); nptotal: 256 (2048 kB); nrows: 7214; ratio: 100.00 INDEX: Next size: 512 kB (64 pages); nptotal: 64 (512 kB); ratio: 100.00 (256 kB -> 512 kB) TABLE: Next size: 1024 kB (128 pages); nptotal: 384 (3072 kB); nrows: 14466; ratio: 50.00 TABLE: Next size: 1024 kB (128 pages); nptotal: 512 (4096 kB); nrows: 21756; ratio: 33.33 TABLE: Next size: 1024 kB (128 pages); nptotal: 640 (5120 kB); nrows: 29041; ratio: 25.00 INDEX: Next size: 1024 kB (128 pages); nptotal: 128 (1024 kB); ratio: 100.00 (512 kB -> 1024 kB) TABLE: Next size: 1024 kB (128 pages); nptotal: 768 (6144 kB); nrows: 36175; ratio: 20.00 TABLE: Next size: 1024 kB (128 pages); nptotal: 896 (7168 kB); nrows: 43515; ratio: 16.67 TABLE: Next size: 1024 kB (128 pages); nptotal: 1024 (8192 kB); nrows: 50751; ratio: 14.29 TABLE: Next size: 1024 kB (128 pages); nptotal: 1152 (9216 kB); nrows: 58019; ratio: 12.50 INDEX: Next size: 1024 kB (128 pages); nptotal: 256 (2048 kB); ratio: 100.00 TABLE: Next size: 1024 kB (128 pages); nptotal: 1280 (10240 kB); nrows: 65372; ratio: 11.11 TABLE: Next size: 1024 kB (128 pages); nptotal: 1408 (11264 kB); nrows: 72545; ratio: 10.00 TABLE: Next size: 2048 kB (256 pages); nptotal: 1536 (12288 kB); nrows: 79871; ratio: 9.09 (1024 kB -> 2048 kB) TABLE: Next size: 2048 kB (256 pages); nptotal: 1792 (14336 kB); nrows: 87169; ratio: 16.67 TABLE: Next size: 2048 kB (256 pages); nptotal: 2048 (16384 kB); nrows: 101742; ratio: 14.29 TABLE: Next size: 2048 kB (256 pages); nptotal: 2304 (18432 kB); nrows: 116352; ratio: 12.50 INDEX: Next size: 1024 kB (128 pages); nptotal: 278 (2224 kB); ratio: 50.00 TABLE: Next size: 2048 kB (256 pages); nptotal: 2560 (20480 kB); nrows: 130862; ratio: 11.11 INDEX: Next size: 1024 kB (128 pages); nptotal: 406 (3248 kB); ratio: 46.04 TABLE: Next size: 2048 kB (256 pages); nptotal: 2816 (22528 kB); nrows: 145461; ratio: 10.00 TABLE: Next size: 4096 kB (512 pages); nptotal: 3072 (24576 kB); nrows: 160073; ratio: 9.09 (2048 kB -> 4096 kB) TABLE: Next size: 4096 kB (512 pages); nptotal: 3584 (28672 kB); nrows: 174651; ratio: 16.67 INDEX: Next size: 1024 kB (128 pages); nptotal: 534 (4272 kB); ratio: 31.53 TABLE: Next size: 4096 kB (512 pages); nptotal: 4096 (32768 kB); nrows: 203750; ratio: 14.29 TABLE: Next size: 4096 kB (512 pages); nptotal: 4608 (36864 kB); nrows: 232818; ratio: 12.50 INDEX: Next size: 1024 kB (128 pages); nptotal: 662 (5296 kB); ratio: 23.97 TABLE: Next size: 4096 kB (512 pages); nptotal: 5120 (40960 kB); nrows: 261920; ratio: 11.11 TABLE: Next size: 4096 kB (512 pages); nptotal: 5632 (45056 kB); nrows: 290967; ratio: 10.00 INDEX: Next size: 1024 kB (128 pages); nptotal: 790 (6320 kB); ratio: 19.34 TABLE: Next size: 8192 kB (1024 pages); nptotal: 6144 (49152 kB); nrows: 320100; ratio: 9.09 (4096 kB -> 8192 kB) TABLE: Next size: 8192 kB (1024 pages); nptotal: 7168 (57344 kB); nrows: 349442; ratio: 16.67 INDEX: Next size: 1024 kB (128 pages); nptotal: 918 (7344 kB); ratio: 16.20 TABLE: Next size: 8192 kB (1024 pages); nptotal: 8192 (65536 kB); nrows: 407578; ratio: 14.29 INDEX: Next size: 1024 kB (128 pages); nptotal: 1046 (8368 kB); ratio: 13.94 TABLE: Next size: 8192 kB (1024 pages); nptotal: 9216 (73728 kB); nrows: 465592; ratio: 12.50 INDEX: Next size: 1024 kB (128 pages); nptotal: 1138 (9104 kB); ratio: 12.24 TABLE: Next size: 8192 kB (1024 pages); nptotal: 10240 (81920 kB); nrows: 523746; ratio: 11.11 INDEX: Next size: 1024 kB (128 pages); nptotal: 1266 (10128 kB); ratio: 11.25 TABLE: Next size: 8192 kB (1024 pages); nptotal: 11264 (90112 kB); nrows: 581740; ratio: 10.00 INDEX: Next size: 1024 kB (128 pages); nptotal: 1394 (11152 kB); ratio: 10.11 TABLE: Next size: 16384 kB (2048 pages); nptotal: 12288 (98304 kB); nrows: 639785; ratio: 9.09 (8192 kB -> 16384 kB) INDEX: Next size: 2048 kB (256 pages); nptotal: 1522 (12176 kB); ratio: 9.18 (1024 kB -> 2048 kB) TABLE: Next size: 16384 kB (2048 pages); nptotal: 14336 (114688 kB); nrows: 697932; ratio: 16.67

Another feature of the engine is merging adjacent extents for the same object. This excerpt from an oncheck -pe report is from the 8 kB example:

Description Offset(p) Size(p) ------------------------------------------------------------- -------- -------- dbname:'benthompson'.t1 187897 14336 dbname:'benthompson'.ui_t1 221782 278 dbname:'benthompson'.ui_t1 223004 860 dbname:'benthompson'.ui_t1 234148 384

In my example the engine has managed to merge all the extents for the table into one and has merged the index extents into three (278 + 860 + 384 = 1522).

If you want to try out the test yourself the code is in the skybet/informix-helpers github repository

Monitoring Informix with Grafana

Posted: 3 January, 2018 | Author: Ben Thompson | Filed under: Instance and operating system tuning, Monitoring | Tags: Grafana, Graphite, InfluxDB, informix | 6 Comments

Introduction

In a presentation I gave at IIUG 2017 titled Making system monitoring better I showed, without much detail, how Grafana is a powerful tool for visualising what is happening within your Informix server. Performance metrics from your database server are collected at regular (usually 10 second) intervals and stored in a time-series database which can be used as the source for dashboards containing dynamic graphs and other ways of presenting the data. For the DBA the benefits are legion:

Quickly change the time range to zoom into when problems occurred or zoom out to see trends.
Correlate various database metrics and combine then with related operating system, network, storage or application metrics.
Get a truer picture of your busy periods, capacity, the effect of scheduled jobs etc.
Faster problem resolution because a lot of data can be visualised at once.

You might be able to think of some others.

The talk also touched on the CAMS acronym:

Culture
Automation
Measurement
Sharing

So you shouldn’t keep your dashboards to yourself: share them with other technical teams or everyone in your company. This has the added benefit of more eyes and others can learn to spot database problems, or when they are probably not database problems, by referring to these.

Why Grafana?

There are a number of tools which appear to do a similar job:

You perhaps haven’t heard of Brunia: it is the code name for a prototype monitoring tool that may replace Informix Open Admin Tool (OAT) in the future. It was demonstrated at IIUG 2017 and is probably closest to Prometheus in its execution. AGS Sentinel is the monitoring add-on to the popular ServerStudio suite for Informix. The rest are popular open source tools which other teams in your organisation are probably already using.

Some of the tools listed above can also produce events or alerts when a trigger condition occurs and automatically pass this up a stack to PagerDuty or another call-out system. An example of such a stack is Prometheus -> Alertmanager -> PagerDuty -> StatusPage

There are a lot of ways of implementing a full monitoring stack with choices to make about data collection, storing, visualisation, analysis and alerting. In this blog post I am going to concentrate on a simple set up where we collect Informix metrics, pass them to InfluxDB using a REST API and visualise in Grafana. For a fuller discussion of the benefits of the three open source technologies mentioned above I highly recommend reading this blog post from Loom Systems written in June 2017, Prometheus vs. Grafana vs. Graphite – A Feature Comparison.

In case you’re not going to read the LS blog it’s worth emphasising what the InfluxDB/Grafana solution I am about to describe does not provide:

There is little in the way of monitoring and alerting features.
Regression models enabling you to predict the value of a metric in the future are not available.
Advanced time series functions are not available.

The solution would be richer if Graphite was used as the data source and Grafana for visualisation only. This would provide more aggregation functions and allows you to do things like subtract one time series from another. As an example of what this might provide, I have a dashboard (not covered in this blog post) displaying the buffer turnover ratio and buffer waits ratio over an arbitrary moving window irrespective of when onstat -z was last run.

It is easy to confuse Graphite and Grafana, especially as both can be used independently, or Graphite can be a data source for Grafana.

As this is an Informix blog I ought to explain why I am using InfluxDB and not Informix time series? The simple answer is that to use Informix time series with Grafana properly someone would have to write and maintain a data source plugin for it like the one for InfluxDB we’ll be using. Doing so would give something more feature rich than InfluxDB for sure but perhaps not much more powerful than a Graphite/Grafana combination.

What to monitor

Potentially anything we can put a value to every ten seconds can be collected and stored in InfluxDB (which is a statement you can make about time series collections in general). For Linux operating system metrics there is a well-established collection daemon called collectd and, if I had better C programming skills, I could a collectd plugin for Informix.

For Informix systems the most obvious source is the system monitoring interface (SMI) which is the presentation of information held in shared memory through pseudo-tables in the sysmaster database. This covers the vast majority of what can be collected using onstat but is easier to handle in a programming language. Doing it this way means we can also collect real table data in the same manner.

For example the system profile or onstat -p can be captured with the following SQL:

SELECT TRIM(name) AS name, value FROM sysmaster:sysprofile WHERE name IN ('dskreads', 'bufreads', 'dskwrites', 'bufwrites', 'isamtot', 'isopens', 'isstarts', 'isreads', 'iswrites', 'isrewrites', 'isdeletes', 'iscommits', 'isrollbacks', 'latchwts', 'buffwts', 'lockreqs', 'lockwts', 'ckptwts', 'deadlks', 'lktouts', 'numckpts', 'plgpagewrites', 'plgwrites', 'llgrecs', 'llgpagewrites', 'llgwrites', 'pagreads', 'pagwrites', 'flushes', 'compress', 'fgwrites', 'lruwrites', 'chunkwrites', 'btraidx', 'dpra', 'rapgs_used', 'seqscans', 'totalsorts', 'memsorts', 'disksorts', 'maxsortspace')

It’s important to understand that all of these metrics are what I’d term counters. That is they only increase over time (unless they get so large they run out of bits and wrap or a DBA runs onstat -z). It gets difficult to see on a graph the difference between, say, 2394472 and 2394483 and so it’s useful to calculate a delta over the ten second window. Some things you might collect are automatically suitable for graphing because they are gauges: an example of this is the number of threads in your ready queue at any given moment.

Implementation

Practical demonstration with Docker containers

Nothing better than an example you can try at home (or work!). In the implementation example I will be using the IBM Informix Developer Edition Docker container which, at time of writing, runs Debian 8 (jeesie) and a second Docker container for InfluxDB and Grafana. You’ll of course need Docker installed on your laptop or workstation for this to work.

What this demonstration is going to build will look like the above. A collector script will collect metrics from Informix at a regular interval and post the results to InfluxDB. You will be able to use your usual web browser to connect to Grafana and visualise the data. Sounds simple?

We’ll start by setting up the InfluxDB/Grafana Docker container which will be also be using on a (minimal) Debian installation. In a terminal run:

docker pull debian docker run -it --name influx_grafana_monitoring -p 8086:8086 -p 3000:3000 --hostname grafserv -e "GF_SECURITY_ADMIN_PASSWORD=secret" debian

Your terminal should now be inside the Docker container and logged in as root. Run these commands to install some extra packages and then InfluxDB:

apt-get update apt-get -y install curl gnupg apt-transport-https procps curl -sL https://repos.influxdata.com/influxdb.key | apt-key add - echo "deb https://repos.influxdata.com/debian jessie stable" | tee -a /etc/apt/sources.list apt-get update apt-get -y install influxdb

Next install Grafana in the container:

echo "deb https://packagecloud.io/grafana/stable/debian/ jessie main" | tee -a /etc/apt/sources.list curl https://packagecloud.io/gpg.key | apt-key add - apt-get update apt-get -y install grafana

Start the both services inside the container:

/etc/init.d/influxdb start /etc/init.d/grafana-server start

We need to create an Influx database to store our time series data and we can do this with a REST API call:

curl -i -XPOST http://localhost:8086/query --data-urlencode "q=CREATE DATABASE informix"

If it works you should see a HTTP/1.1 200 OK response.

You should now be able to access the Grafana server running in your Docker container from your usual web browser at http://localhost:3000/

Log in with the user name admin and the password secret. Once logged in click Add data source and fill in the settings as follows (some of them are case-sensitive):

Name: informix
Type: InfluxDB

HTTP settings

URL: http://localhost:8086
Access: direct

HTTP Auth

Basic auth: Leave unticked
With credentials: Leave unticked

InfluxDB Details

Database: informix
User: Leave blank
Password: Leave blank
Min time interval: Leave at 10s

All being well you should see Data source is working in a big green box.

Now we are going to set up the Informix container to monitor. On your workstation in another terminal run:

$ docker pull ibmcom/informix-developer-database docker run -it --name iif_developer_edition --privileged -p 9088:9088 -p 9089:9089 -p 27017:27017 -p 27018:27018 -p 27883:27883 --hostname ifxserver -e LICENSE=accept ibmcom/informix-developer-database:latest

The command above should provide a running Informix instance which may take a few moments after which control is passed back to the terminal. We are now going to get the scripts that will send data to InfluxDB:

sudo -i apt-get update apt-get -y install git libdbi-perl libjson-perl libwww-curl-perl make gcc libtest-pod-perl

We need to get the Perl DBD::Informix package from CPAN which will download, compile, test and install it for us.

. /home/informix/ifx_dev.env export DBI_DBNAME=sysmaster export DBD_INFORMIX_DATABASE=sysmaster export DBD_INFORMIX_USERNAME=informix export DBD_INFORMIX_PASSWORD=in4mix cpan

Enter ‘yes’ to configure as much as possible. In the CPAN console type the case-sensitive command:

install DBD::Informix

There is quite a lot that could go wrong in the CPAN console but it should work if you’re using the IBM Informix DE Docker container and follow the steps exactly. If you’re installing on RedHat Linux or a derivative the required RPM package names you use with yum install will all be different.

Type logout to exit the root shell. You should be logged in as user informix again. Leave this session for later.

Run on your local workstation (outside both Docker containers) in another terminal window:

git clone https://github.com/skybet/informix-helpers.git

This assumes you have git installed. There are two important files in the post_to_influxdb directory:

informix_smi_influx_uploader
informix_smi.json

You will need to edit informix_smi.json and change all references from mydatabase to the name of the user/application database you want to monitor. For the purposes of the blog post in this demo, we are just going to monitor the sysmaster database so change all mydatabase references to sysmaster.

You can copy the files to your Informix Docker container as follows. To get the name of your Informix Docker container (not its hostname) simply type docker ps on your workstation.

$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5a9c73712429 debian "bash" 8 minutes ago Up 8 minutes 0.0.0.0:3000->3000/tcp, 0.0.0.0:8086->8086/tcp influx_grafana_monitoring e1c178b57164 ibmcom/informix-developer-database:latest "/bin/bash informi..." 13 minutes ago Up 13 minutes 0.0.0.0:9088-9089->9088-9089/tcp, 0.0.0.0:27017-27018->27017-27018/tcp, 0.0.0.0:27883->27883/tcp iif_developer_edition

From the above my Informix container name is e1c178b57164

docker cp informix_smi_influx_uploader e1c178b57164:/home/informix docker cp informix_smi.json e1c178b57164:/home/informix

We should be ready to start collecting metrics and posting them to InfluxDB. Run in the Informix container as user informix:

cd ./informix_smi_influx_uploader -c ./informix_smi.json -i 10 -u http://other_container_ip:8086/write?db=informix

Change other_container_ip to the IP address of your InfluxDB/Grafana container. You must use the IP address unless you have name resolution which this basic Docker set up does not. If you don’t know what this is you can ping the docker container name from inside the InfluxDB/Grafana container using a command like ping -c 1 grafserv

All being well the Perl script should run continuously and collect and post data to InfluxDB every 10 seconds or whatever interval you specified with the -i switch.

To see anything in Grafana we’ll need to set up a dashboard. The file informix_server.json in the grafana_dashboard directory describes a Grafana dashboard. You’ll need to edit it a bit first and change all occurrences of the following:

<%= @hostname %>: hostname of your Informix docker container, normally the unqualified hostname of your Informix server
<%= @informixserver %>: Name of your Informix instance, dev

In the Grafana web browser click the Grafana logo and then Dashboards. Click Import. At Import Dashboard click Upload .json File. Hey presto, you should have a dashboard with graphs. Some may not display any data, e.g. Temporary dbspace usage, because there are no temporary dbspaces in the development Docker image by default.

Making this ready for production

There are a few bits I’ll leave to you for any production implementation:

The collection shouldn’t run as user informix. Create a user for the monitoring and give it just the CONNECT and SELECT privileges it needs.
You’ll also need to write a script to start/stop the collection with the instance.
Linux operating system statistics, gathered through collectd would complement the dashboard very nicely.
You’ll probably want to customise both the JSON file containing the SMI queries and the one describing the dashboard. You could add application metrics than can be collected from the database or create multiple dashboards especially if you don’t like the idea of one big one showing everything. Bear in mind any queries you write need to be fast and will run every 10 seconds.
In the dashboard you may need to add/remove different page sizes to/from the buffer, page and disk reads/writes graphs.

Conclusion

It can be very useful to track virtual segment and temp. space usage on your server and correlate with events like update stats, ontape/onbar backups or application activity. You can use other tools to do this but these often are not as accessible or are purely in the realm of the DBA. A Grafana dashboard like the one described here should be very useful for you and colleagues, especially if they have their own dashboards on the same system which allow you to view your systems as a whole, and it might go some distance to demystifying Informix in your organisation.

Passwords encrypted over the network: why is this feature not enabled by default?

Posted: 26 June, 2017 | Author: Ben Thompson | Filed under: Compliance, Instance and operating system tuning | Leave a comment

In 2015 I wrote a blog post about using simple password encryption (SPW) and how – without it – your valuable passwords can be trivially sniffed on your network. If you look through the post it illustrates the vulnerability and shows just how easy it is to set your system up in a more secure way.

SPW only encrypts your password when you connect. Not everyone wants or needs full encryption of all their traffic but what reasons are there not to use SPW?

It requires a small amount of set up extra work, although this can (should?) be automated.
It means your database engine spawns some extra cssmbox_cn threads, although they are only used at connection time and the overhead is low.
Consideration should be given to patching the IBM Global Security Kit (GSKit) separately from the server and client, both of which bundle it.

I don’t know of any other drawbacks. In my opinion these are nothing substantive then when you consider your peace of mind.

If you have Fix Central access you can always download the latest GSKit from here. Although it’s used by many IBM products it’s filed under Tivoli which isn’t obvious at all.

Patching the GSKit separately isn’t necessarily something you need to do but it isn’t only used by SPW: if you’ve set ENCRYPT_HDR, ENCRYPT_SMX or ENCRYPT_CDR, for example, you are using it. The GSKit doesn’t get installed in INFORMIXDIR; it’s installed by RPM (on Linux) to /usr/local/ibm and only one version can exist on your server. So if you’re used to pre-installing a new version of Informix server or Client SDK in its own folder prior to an upgrade, be aware that you may just have unwittingly upgraded the GSKit.

The feature has suffered a few issues lately and is currently broken when used with the Informix JDBC driver in 11.70.xC9; connections supported by CSDK or IConnect work fine. I think the feature would be more dependable if more people used it (or if the product testing stress tested this area). Here are some relatively recent issues:

All recent JDBC drivers earlier than 4.10.JC8 (including 3.70.JC8W1) suffer from an issue where a small proportion of connections will fail. You might not notice this if your application can capture logon failures and retry automatically. There is no APAR for this that I know of as 4.10.JC8 was extensively reworked for JDBC 4.0 support.
Informix 11.70.xC9 contains fix IT10493 but this caused a high rate of logon failures with SPW and fix IT17087 is additionally needed but not included.
If you’re using the 12.10 code line you need xC8 or later to get the same fix.
CSDK 4.10.FC8 ships with an incompatible GSKit version, 8.0.50.66, but actually requires 8.0.50.70+ (APAR IT18763). You may not notice this, however, if your server software ships with a later version.

I hope this doesn’t come across as a moan, more a call to action.

When do my stored procedure execution plans get updated?

Posted: 23 June, 2017 | Author: Ben Thompson | Filed under: Instance and operating system tuning, Monitoring, SQL and SPL debugging, SQL query tuning | Leave a comment

For the sake of brevity in this article I am going to group procedures, functions and routines together as stored procedures and ignore any differences between them.

What does the SQL command UPDATE STATISTICS FOR PROCEDURE/FUNCTION/ROUTINE does and perhaps, more pertinently, as a DBA do I need to run this regularly to ensure my systems are working efficiently? For those wanting an immediate answer I think it is “never” or “almost never“, the reasons for which I hope to explain clearly in this article.

The command itself is straightforward: calling it causes Informix to parse a stored procedure’s code and produce a query plan for all of it based on the current statistics and data distributions (if the procedure references any tables). It then writes the query plan to the sysprocplan table which is, unless you have an unlogged database, a logged operation written to the logical log. Used with no parameters it does this for all stored procedures in the system.

As long as the query plan in sysprocplan is reasonably efficient there is probably no need to (ever) proactively update it but there may be cases when you’d want to do so, for example, if a very small or empty table has grown into a large one. However if you were to do this your new plan would be based on the current table statistics and data distributions and if these haven’t been updated yet you may get the same, now inefficient, plan.

The manual states:

The sysprocplan system catalog table stores execution plans for SPL routines. Two actions can update the sysprocplan system catalog table:

Execution of an SPL routine that uses a modified table

The UPDATE STATISTICS FOR ROUTINE, FUNCTION, or PROCEDURE statement.

There is a created column in the sysprocplan table but it’s a date and not a date/time which makes it much harder to match plan updates to other events.

So what is a modified table? Quite simply it is one where the version number has been incremented. You can see the version number with an SQL query like:

select version from systables where owner='myowner' and tabname='mytable';

I think the only reference to this in manual is in section about the systables view where it simply says:

version
INTEGER
Number that changes when table is altered

How the engine works out the dependencies a stored procedure has on different tables falls into the category of system internals, which IBM chooses not to publicly document, but I think it’s safe to say that if a table is referenced anywhere in a procedure it is dependent on it.

There are many ways a table can be “altered”, some more obvious than others:

Method	Version number incremented by
GRANT SELECT	1
GRANT UPDATE	1
GRANT SELECT	1
GRANT SELECT, UPDATE	1
UPDATE STATISTICS HIGH	2
UPDATE STATISTICS MEDIUM	2
UPDATE STATISTICS [LOW]	2
RENAME COLUMN	65536
CREATE INDEX	65536
ADD column	131072
DROP column	131072
GRANT CONNECT	0
GRANT RESOURCE	0
GRANT DBA	1

I am not sure why some operations increment the value by large numbers, all powers of 2, as any increment has a similar effect, at least as far as the scope of this article is concerned.

The table is not a complete list because there are many possible DDL operations but this does already illustrate or suggest that:

On most systems it’s likely that UPDATE STATISTICS commands will be the main trigger for stored query plans to be updated. If you run LOW, HIGH and MEDIUM modes for a table like you will if you use AUS or dostats, you’ll trigger at least three updates for dependent stored procedures (if they are called).
If we want to grant multiple privileges on the same table, it’s best to do it in a single statement because if a dependent stored procedure is being called in between running commands by an application, its stored execution plan will be updated only once.
GRANT DBA is not a table level operation yet it has an effect.

Further testing shows that both the GRANT DBA and REVOKE DBA statements increment the version number on all tables in the same database where the tabid is 100 or greater, that is all user tables. From the manual above it follows that the stored query plans for all stored procedures or functions dependent on a table will be updated the next time they are executed.

On our systems we see a large amount of writes to sysprocplan after granting or revoking the DBA privilege to anyone. When graphed we see a sharp peak and a long exponential tail off as less commonly used procedures get called.

Therefore if you grant DBA to a user on a busy live system, it can affect concurrency more than you might expect. On an idle system you may want to run UPDATE STATISTICS FOR PROCEDURE immediately afterwards to update the stored query plans in an orderly way and save the first session to call any given procedure from this overhead.

I think running the command offline to avoid the overhead for user or application sessions is possibly the only true use case for this command.

Buffer waits

Posted: 21 November, 2016 | Author: Ben Thompson | Filed under: Instance and operating system tuning, Monitoring | 3 Comments

Is it really a year since I last wrote a blog post? It does mean I have had time to think of a few topics to write about.

Recently a desk visitor came to me at work about a performance issue. From what he said I got the impression that he thought that DBAs spend most of their time tuning SQL queries, something I spent little time on during a typical day, perhaps one of the advantages of working on a mature system.

Perhaps though he was kind of right. Many of the things I do are around making sure queries run reliably, consistently and in a scalable manner, I just don’t necessarily do this by looking at explain plans. Don’t get me wrong: these are very important; it’s just that once these are as good as they can be it doesn’t mean there aren’t other ways of finding bottlenecks or contention and tuning them out. For example when I was a more junior DBA I used to concern myself with buffer cache hits and, once I knew a little more, buffer turnover ratios, reducing I/O by allocating more memory as Moore’s Law provided rapid improvements in server CPU speed and memory size.

In the Oracle world DBAs have moved away from this measure and use the Oracle wait interface which allows you to see what operations the engine was doing. We can do this (slightly differently) in Informix too and it’s very useful. The simplest overview of waits is to look at position one of the flags column from “onstat -u”. The Informix documentation states:

flags
Provides the status of the session.
The flag codes for position 1:
B Waiting for a buffer
C Waiting for a checkpoint
G Waiting for a write of the logical-log buffer
L Waiting for a lock
S Waiting for mutex
T Waiting for a transaction
Y Waiting for condition
X Waiting for a transaction cleanup (rollback)

Most sessions will probably show Y which usually means they are waiting on TCP transit or are idle, waiting for the application or user’s session to do something.

As the manual says we can get more detail for buffers through onstat -b, -B and -X; for latches (mutexes) through onstat -s, (also -g lmx and -g wmx) and locks with onstat -k.

onstat -X is the most useful way to examine buffer waits but it is not as user-friendly as it could be, which is a shame because it often offers many clues to where problems lie:

Buffers (Access) address owner flags pagenum memaddr nslots pgflgs scount waiter


Buffer pool page size: 2048

 14700 modified, 16777216 total, 16777216 hash buckets, 2048 buffer size
Buffer pool page size: 4096

2443cd7e8        ffffffffffffffff 80    25:1523486       cce17b000        101    2801   0        0

267ca4628        0                0     47:1570054       105c3c5000       122    2890   1        0

 53980 modified, 8388608 total, 8388608 hash buckets, 4096 buffer size
Buffer pool page size: 8192

 59577 modified, 8388608 total, 8388608 hash buckets, 8192 buffer size

Buffer pool page size: 16384 3784a8188 ffffffffffffffff 80 162:18904760 4baadf4000 248 890 0 0 37854d188 ffffffffffffffff 80 162:24581408 4baeff4000 248 890 0 0 378ead5e8 ffffffffffffffff 80 124:25597240 4beb010000 187 2801 0 0 378f781a8 ffffffffffffffff 80 124:25597376 4bf0128000 710 890 0 0 3798d3388 ffffffffffffffff 80 124:25597176 4c2bf34000 710 890 0 595236d428 3799321a8 ffffffffffffffff 80 162:13196672 4c2e528000 248 890 0 624d39d668 37a353128 ffffffffffffffff 80 124:25597840 4c6f258000 197 801 0 0 37a4cefe8 ffffffffffffffff 80 168:32233760 4c78a50000 399 890 0 0 37c485d28 ffffffffffffffff 80 264:13942672 4d439d8000 319 890 0 0 37c5b45c8 ffffffffffffffff 80 162:24158848 4d4b2dc000 193 2801 0 0 37c80f368 ffffffffffffffff 80 168:33303832 4d5a400000 303 890 0 0 37caf6ce8 0 10 124:25597160 4d6cd70000 710 890 1 0 37ceaab28 ffffffffffffffff 80 166:8227296 4d84898000 332 890 0 0 37ceba8e8 ffffffffffffffff 80 124:25597648 4d84ef0000 710 890 0 0 37d70f4a8 ffffffffffffffff 80 124:25597208 4dba408000 193 801 0 0 37d891088 ffffffffffffffff 80 162:26376432 4dc3e54000 248 890 0 0 37dc9abe8 58cc3c7eb8 80 144:18435488 4dddbd0000 193 2801 0 0 87962 modified, 13762560 total, 16777216 hash buckets, 16384 buffer size

The key problem here from a usability point of view is that it is tedious to convert the chunk:pagenum format into an actual database object.

There is a similar problem with part numbers when deciphering output from, for example onstat -k that can be solved by downloading and compiling the ESQL/C utility partn from the IIUG software repository.

Loosely based on that here is my Perl script, chunkpg, which can provide friendly names for objects in chunk:pagenum format:

#!/usr/bin/perl


# Decipher chunk and page numbers
use strict;

use warnings;
&main;
sub main {
	my $key = &check_params;
	my $chunkno = 0;

	my $nextinfo = 0;

	my $pagesize = 0;

	my $count = 0;

	my $lastdbs = '';

	my $syspagesize = 2;  # change to 4 kb if required
	my (%objs, %start, %end, %c, %ps, %dbs);
	open (ONCHECKPE, "oncheck -pe |");

	while () {

		chomp;

		if ($nextinfo == 1) {

			$nextinfo = 0;

			my ($blank, $path, $size, $used, $free);

			($blank, $chunkno, $path, $pagesize, $size, $used, $free) = split / +/;

			# print "$chunkno: $pagesize kb\n";

			$count = 0;

			$c{$chunkno} = 0;

			$ps{$chunkno} = $pagesize;

		}

		elsif ($_ eq ' Chunk Pathname                             Pagesize(k)  Size(p)  Used(p)  Free(p)') {

			$nextinfo = 1;

		}

		elsif ($_ =~ /^DBspace Usage Report: /) {

			my @parts = split / +/;

			$lastdbs = $parts[3];

			$dbs{$chunkno} = $lastdbs;

		}

		elsif ($_ =~ /^ \w+:\'\w+\'\.\w+ +/) {

			$count++;

			my ($blank, $obj, $offset, $size) = split / +/;

			# printf ("%s: %d -> %d\n", $obj, $offset*2, $offset*2+$size*2);

			$objs{$chunkno}{$count} = $obj;

			$start{$chunkno}{$count} = $offset;

			$end{$chunkno}{$count} = $size + $offset;

			$c{$chunkno}++;

			if (!$dbs{$chunkno}) {

				$dbs{$chunkno} = $lastdbs;

			}

		}

	}

	close (ONCHECKPE);
	while () {

		chomp;

		print "$_ ";

		$_ =~ s/^ +//;

		my @vals = split / +/;

		if ($vals[$key-1] && $vals[$key-1] =~ /\d+\:\d+/) {

			my ($chunk, $page) = split /\:/, $vals[$key-1];

			for (my $i = 1; $i = $start{$chunk}{$i} && $end{$chunk}{$i} && $page/($ps{$chunk}/$syspagesize) <= $end{$chunk}{$i}) {

					print $objs{$chunk}{$i}.','.$dbs{$chunk};

					last;

				}

			}

		}

		print "\n";

	}

}
sub check_params () {
	die print "INFORMIXDIR is not set.\n" unless $ENV{'INFORMIXDIR'};

	die print "INFORMIXDIR is not set to a valid directory (\'$ENV{'INFORMIXDIR'}\').\n" unless -d $ENV{'INFORMIXDIR'};

	die print "INFORMIXSERVER is not set.\n" unless $ENV{'INFORMIXSERVER'};

if ($ARGV[0] && $ARGV[0] eq '-k') { if ($ARGV[1] && $ARGV[1] !~ /\D/) { return ($ARGV[1]); } else { print "Invalid key number: $ARGV[1]\n"; exit 1; } } elsif ($ARGV[0]) { print "Invalid parameter: $ARGV[1]\n"; exit 1; } return (1); }

You can then run as follows: onstat -X | chunkpg -k 4

If you were to run onstat -X repeatedly, perhaps at different times of day, you would begin to build a picture of where buffer waits are occurring.

Having identified buffer waits, what can be done to reduce them?

Ultimately it is going to come back to your database design, including its storage schema, the queries you run and maybe onconfig parameters.

Buffer waits on tables can be reduced using partitioning. Fragmentation by round-robin is effective for improving the rate of concurrent inserts (although it doesn’t facilitate fragment elimination in queries). Very large tables may require some form of partitioning to avoid reaching the 16.7m (2^24) page limit but even smaller tables with a large number of inserts and/or deletes can benefit.

It’s worth noting here that buffer waits occur in server RAM and so it’s not obvious at all that the storage schema should come into it. After it only directly affects what happens on disk, right? Not so: I can show that different storage schemas perform differently by running a concurrent insert test on a machine tuned to only flush data to disk at checkpoints. If a table has 10-way round-robin partitioning then it has ten different part numbers and is for many purposes ten different tables. For inserts we are always appending to the end of the table and so there will be contention on the last buffered page between sessions. Spreading this across ten different partitions reduces it.

Indices can also see a large number of buffer waits. In one stark example I found a large number of buffer waits were occurring on a large single-column index where every value in the column was null (nulls are indexed in Informix, unlike some other RDBMSs) and so there was a single leaf node pointing to all rows. Every insert and delete needed to modify this. A less extreme version of this might be seen where there are a limited number of values a field can take. In this case dropping the index or combining the index column with another to vastly increase the number of nodes would help throughput.

Would index fragmentation/partitioning help? It is probably less useful, in general because there is no 16.7m page limit for indices since version 11.70 and, in the case of fragmentation by expression, the same nodes could be equally congested.

In conclusion I think this method is a useful addition to your performance tuning armoury and by reducing contention you can improve the scalability of your system and increase throughput.

Zone reclaim mode

Posted: 16 October, 2015 | Author: Ben Thompson | Filed under: Instance and operating system tuning | Tags: NUMA, zone_reclaim_mode | 2 Comments

Non-uniform memory access or NUMA is not a new concept but high end multiprocessor Intel-based servers are increasingly configured with this architecture, bringing it more to the mainstream. Put simply NUMA means that instead of all processors accessing your main system memory through a common bus, each processor is allocated an even share of the memory that it can address directly. If a processor needs to access memory controlled by another processor it can do so through that other processor.

Linux kernels from v2.5 onwards are aware of any NUMA architecture and it can be displayed using numactl -H or numactl –hardware:

node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10

The above is from a four socket server. It shows that fetching from local memory is weighted at ’10’ and from memory controlled by other processors ’21’. I strongly suspect these weightings are hard coded.

numactl -H also shows information about how the memory is split between processors. The term ‘node’ is used:

available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 65418 MB node 0 free: 310 MB node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 1 size: 65536 MB node 1 free: 41 MB node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size: 65536 MB node 2 free: 82 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 65536 MB node 3 free: 43 MB

What the above shows is that the free memory available to each node varies. If a process running on node 3, in our example, needs to allocate memory and it needs more than 43 Mb, it can either:

Use memory assigned to another node, for example node 0. This means the memory access will not be local.
Reclaim memory from node 3’s local memory by evicting other pages from memory.

The kernel switch vm.zone_reclaim_mode controls which behaviour is used. If set to 1 it will prefer to evict other pages from memory.

This is explained in a great more detail in this article by Christoph Lameter.

How is this parameter set on your system? You can check by running cat /proc/sys/vm/zone_reclaim_mode

If it’s set to 1 on your Informix system you should definitely read on. You’ll be glad to hear this parameter can be changed dynamically.

In the latest kernels (2014 onwards) this commit means that the parameter will never be set on your system automatically but if you’re running an enterprise Linux you could be on a kernel version like 2.6.32 (RHEL 6) where this can occur: although patched the base version of this dates from 2009.

I am not sure of the exact criteria that determine when older Linux kernels will switch on this feature at boot up. I think you need a modern four (or more) processor server with a NUMA architecture but there may be other requirements.

It’s interesting to read the slightly repetitious kernel commit log:

When it was introduced, zone_reclaim_mode made sense as NUMA distances punished and workloads were generally partitioned to fit into a NUMA node. NUMA machines are now common but few of the workloads are NUMA-aware and it’s routine to see major performance degradation due to zone_reclaim_mode being enabled but relatively few can identify the problem.

Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from local node instead of spilling over to other nodes. This made sense initially when NUMA machines were almost exclusively HPC and the workload was partitioned into nodes. The NUMA penalties were sufficiently high to justify reclaiming the memory. On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Users that are sophisticated enough to know they need zone_reclaim_mode will detect it.

Hopefully now the relevance to Informix is becoming a little clearer. Certainly there has been much complaining in the PostgreSQL community about this parameter. Another frustrated blog post describes some of the massive I/O latency problems it can cause on your system even when under no obvious memory pressure.

On our Informix system, which uses huge pages, we have experienced long disruptive checkpoints as a result of zone reclaiming. As huge pages are not swappable, it’s likely to be our monitoring and other non-Informix processes provoking the zone reclaims.

The long checkpoint behaviour can be summarised as:

A checkpoint is triggered by CKPTINTVL.
Informix instructs all threads to finish what they are doing and goes into state CKPT REQ.
One or more threads may be in critical section and must continue to the end of this section before it can stop.
A zone reclaim is occurring and I/O throughput dramatically decreases and this thread takes many seconds to come out of critical section.
All active threads wait (state C in the first column of onstat -u).
Eventually the operation completes, the checkpoint actually occurs very quickly and processing continues.

This behaviour can occur in later versions of the engine with non-blocking checkpoints.

If you have the mon_checkpoint sysadmin task enabled (I strongly recommend this), information about your checkpoints will be written to sysadmin:mon_checkpoint. (Otherwise you only retain information about the last twenty checkpoints visible through onstat -g ckp.) A tell tale sign is a large crit_time, nearly all of the checkpoint duration, and a much smaller flush_time.

You can get further evidence of whether a zone reclaim might be occurring at the same time by looking at the number of pages scanned per second in the output from sar -B. (sar is a very sophisticated monitoring tool these days with views into many aspects of the operating system.)

One test you can try (on a test server) is LinkedIn Engineering’s GraphDB simulator. It’s a C++ program that mimics the behaviour of GraphDB and is designed to provoke zone reclaim behaviour from the Linux kernel if it is switched on.

On our test system we can leave it running for hours without zone reclaim enabled and monitor it through sar -B.

10:30:55 AM pgpgin/s pgpgout/s 10:31:00 AM 951.42 20993.52 8415.59 10:31:05 AM 294.97 20930.38 8764.59 10:31:10 AM 170.28 24627.31 4939.16 10:31:15 AM 193.12 77519.03 5379.96 10:31:20 AM 240.24 88966.60 6875.45 10:31:25 AM 183.50 277.67 8113.28 10:31:30 AM 202.41 280.08 11409.46 10:31:35 AM 243.37 255.42 8815.46 10:31:40 AM 92.37 194.38 5890.96 10:31:45 AM 283.70 313.08 12742.05 10:31:50 AM 414.83 11179.96 7938.48 10:31:55 AM 198.79 31014.95 9007.47 10:32:00 AM 235.74 25065.86 10159.84 10:32:05 AM 202.01 37361.45 11010.24 10:32:10 AM 256.91 5640.48 7596.59 10:32:15 AM 246.89 20823.65 5411.42 10:32:20 AM 114.46 41366.27 6625.30 10:32:25 AM 188.76 20948.19 25422.09 10:32:30 AM 177.15 29934.67 9358.52 10:32:35 AM 237.83 9914.69 9167.40 10:32:40 AM 207.71 81296.55 8555.17 fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 0.81 1351.62 0.00 0.00 0.00 0.00 2.21 3286.92 0.00 0.00 0.00 0.00 1.61 1859.64 32276.31 16282.73 565.06 1.16 1.42 53762.75 4495.55 0.00 93.72 2.08 1.81 1483.30 0.00 0.00 0.00 0.00 1.61 4045.47 0.00 0.00 0.00 0.00 2.82 3114.29 0.00 0.00 0.00 0.00 2.21 1905.62 0.00 0.00 0.00 0.00 1.00 1059.84 0.00 0.00 0.00 0.00 2.21 5263.38 0.00 0.00 0.00 0.00 2.00 45495.59 39413.23 0.00 784.17 1.99 2.63 2374.95 0.00 0.00 0.00 0.00 2.61 1866.47 0.00 0.00 0.00 0.00 2.01 3250.00 0.00 0.00 0.00 0.00 3.01 3638.08 0.00 0.00 0.00 0.00 1.80 1704.21 0.00 0.00 0.00 0.00 0.80 1352.41 0.00 0.00 0.00 0.00 1.81 8850.20 0.00 0.00 0.00 0.00 1.60 54522.65 42292.59 4315.83 1071.14 2.30 2.21 2483.50 0.00 0.00 0.00 0.00 2.64 2631.85 0.00 0.00 0.00 0.00

The test itself reports latencies over 100 ms and in this mode we occasionally see I/O operations taking around 200 ms reported.

We can change the kernel parameter dynamically while the test is running and see the behaviour change almost immediately:

10:35:15 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 10:35:20 AM 365.06 15634.14 6300.40 3.41 3841.57 0.00 15241.77 2644.18 17.35 10:35:25 AM 333.06 5519.35 9262.10 3.43 8639.31 0.00 92890.32 4528.63 4.88 10:35:30 AM 1158.15 20868.81 10292.96 10.06 12215.09 0.00 255137.22 7858.55 3.08 10:35:35 AM 781.12 41385.54 7742.77 5.02 5841.16 0.00 34506.02 3422.89 9.92 10:35:40 AM 518.10 8764.47 2524.85 3.25 2906.59 0.00 1703326.11 2016.93 0.12 10:35:52 AM 2576.57 39524.85 13449.49 11.31 10332.12 0.00 1153144.24 4256.77 0.37 10:35:57 AM 2707.22 40786.31 7962.55 8.17 9893.92 0.00 4246095.82 6729.66 0.16 10:36:02 AM 1600.75 1889.37 2551.12 4.34 629.04 0.00 3595585.63 253.52 0.01 10:36:16 AM 756.94 39362.58 2063.18 8.25 3785.71 0.00 4238635.01 1814.29 0.04 10:36:21 AM 990.94 9277.31 1584.26 6.24 1692.88 0.00 6222810.91 833.73 0.01 10:36:52 AM 69.73 0.00 116.91 0.96 271.29 0.00 2056531.75 7.20 0.00

The number of pages scanned per second escalates.

Meanwhile I/O latencies reported by the test program escalate up to 36000 ms. We actually have to kill the test program within 30 seconds of changing the kernel parameter to avoid the system becoming so unresponsive it cannot maintain sshd connections.

In our real world Informix example we are not using the page cache anything like as aggressively and when the problem occurs I/O demands reduce as we get down to a single thread in critical section. Thus we don’t see pages scanned at the rate in the test, just a clear increase.

It’s worth mentioning that new NUMA capabilities have been added to the Linux kernel in version 3.8 (and later in 3.13) so RHEL 7 users might see slightly different behaviour.

Large parallel index builds and temp space

Posted: 7 July, 2015 | Author: Ben Thompson | Filed under: Instance and operating system tuning, SQL query tuning, Storage schema management | Tags: informix, parallel index build, temp space | Leave a comment

This is a quick post about parallel index builds. Today I was building with PDQPRIORITY a unfragmented detached index on a large table fragmented by range with ten large fragments and I saw this message in the online log:

10:28:53 WARNING: Not enough temp space for parallel index build. Space required = 110566014 pages; space available = 8385216 pages. Partial index build started.

You can see I am quite a long way short of the temp space required here; I need just over thirteen times more.

In this instance I have eight temporary dbspaces available and all are listed in the DBSPACETEMP onconfig parameter and I have no local override. They are all 2 Gb and using a 16 kb page size so have 131072 pages each and, as I am in single user mode, I know they are all free. onstat -d confirms that 131019 pages of 131072 are free in each of them. In case it’s relevant I also have 1,027,203 2 kb pages free in rootdbs.

The first thing that confuses me is the 8,385,216 pages the online log message says are available, which is more than I actually have. 131019 * 8 = 1048152. I think this is a bug as it’s a factor of 8 out. It’s probably assuming a 2 kb page size somewhere and my 16 kb dbspaces are a 8x multiple of this. I am using Linux so is Informix using native page size units and just not making it clear?

The index I am creating is on a bigint field and there are 7,076,224,823 rows. If I assume 110,566,014 pages actually means 210 Gb, the engine is calculating 32 bits/row or 4 bytes/row exactly which sounds right.

Anyway despite the message in the online log I am comforted by this IBM support article which tells me:

You do not have to take action. This is a warning. The database server will create the index one fragment at a time, instead of all at once.

However, it does advise me that cancelling the statement, adding more temp space and starting again would be a good idea. This is prescient as we’ll see.

Analysing this now it is probably going to fail somewhere because I need thirteen times more space but the engine can only divide the workload by working on a single fragment at a time. There are ten and they are not all exactly the same size. In fact my largest fragment has 1,950,612,068 rows, 27% of the total and based on 4 bytes/row the largest fragment I can handle would have only 536,653,818 rows. I suspect this means to succeed I will need at least 30,478,314 2 kb pages available to complete the build. I hope this all makes sense anyway!

Foolhardily and possibly because I get distracted by something I leave it to run. More messages appear in the log as the build progresses:

11:22:33 WARNING: Not enough temp space for parallel index build. Space required = 110566014 pages; space available = 8385216 pages. Partial index build started. 12:19:28 WARNING: Not enough temp space for parallel index build. Space required = 110566014 pages; space available = 8385216 pages. Partial index build started. 13:27:03 WARNING: Not enough temp space for parallel index build. Space required = 110566014 pages; space available = 8385216 pages. Partial index build started. 13:47:56 Session Insufficient space in temporary dbspaces: Creating the temporary table in the root dbspace, Temporary table size is 17632 pages.

Nearly four hours after it began at 14:27:41 my index build fails with:

212: Cannot add index. 179: ISAM error: no free disk space for sort

Harumph.

I guess there are two annoying things about this:

The support article is only right if your largest fragment will not require more space than is available.
The failure could have been foreseen at the beginning by looking at row counts.

Anyway, I hope this helps. If I get time I will do some more testing on this to confirm some of the assumptions I have made in writing this article. Feedback is welcome as ever (via https://informixdba.wordpress.com for those of you reading this on PlanetIDS)!

Experience with Auto Update Statistics (AUS)

Posted: 30 April, 2015 | Author: Ben Thompson | Filed under: Instance and operating system tuning, Monitoring, SQL query tuning | Tags: AUS, Auto Update Statistics, AUTO_STAT_MODE, DBUPSPACE, IC91678, PSORT_DBTEMP, PSORT_NPROCS, STATCHANGE, USTLOW_SAMPLE | 1 Comment

Introduction

This is based on my previous blog post, Working with auto update statistics which I’ve expanded and significantly improved. I presented this at the Informix International User Group Conference 2015.

Let’s start at the beginning. Why do we run UPDATE STATISTICS at all? When we write an SQL query and send it to the database engine to execute, there may be several ways that the engine can run the query. For example if there are two tables the engine can start with the first table and join the second to it or begin with the second and join the first. There may be also two filter conditions, one of which may very specific and pick out only a few rows; the other may be very general. It should be apparent that some ways are more efficient than others, sometimes by several orders of magnitude.

Informix uses a cost-based optimizer to determine how to run queries. This relies on metadata to provide information about how large tables are, how many distinct values there are and other information about your data. We call these pieces of information statistics and if we also have a histogram of a column showing the abundance of specific values or ranges we call this a distribution.

The optimizer looks at possible query plans and chooses the one with the lowest costs assigned to it, according to the statistics and distributions. This may or may not be the best plan possible; if not you may see poor performance. Inaccurate cost predictions could be because your statistics are inadequate or out of date.

Maintaining the statistics and distributions is a key DBA responsibility.

What statistics and distributions should you maintain? The purpose is to ensure the optimizer selects the most efficient query plan for your queries.

These query plans should be stable over time. Normally stability is achieved through not changing things: this is why your Change Manager likes to say no (sometimes). However, with UPDATE STATISTICS stability comes from regularly refreshing your statistics and distributions: this is a change to your system you may be doing daily.

A slide from my presentation to the IIUG 2015 Conference.

What statistics do you need? The Informix Performance Guide offers a general set of recommendations. However:

They may be more than you need and therefore more to maintain, which could be a headache with a large system.
In some specific cases they may not be good enough. The result of this can be an application codebase full of query directives (instructions to the optimizer to run queries in a particular way).
The guide doesn’t say much about how frequently you should run UPDATE STATISTICS.

Statistics and distributions

Statistics	Distributions
Updated by UPDATE STATISTICS [LOW]. systables: nrows: number of rows. npused: number of pages used on disk. ustlowts: when UPDATE STATISTICS was last run. syscolumns: for indexed columns only: colmin: the second lowest value. colmax: the second highest value. sysindices: levels: number of B-Tree levels. leaves: number of leaves. nunique: number of unique values in the first column. clust – incremented when values in the index are different to the last: max value is the number of rows; a low number indicates lots of duplicates. nrows: number of rows. ustlowts: when UPDATE STATISTICS was last run. ustbuildduration: time to build index statistics.	Updated by UPDATE STATISTICS MEDIUM or HIGH. Consist of histograms for values or value ranges in equally sized buckets in sysdistrib. Fragment-level statistics are stored in sysfragdist.

Statistics

Distributions

Updated by UPDATE STATISTICS [LOW].
systables:
- nrows: number of rows.
- npused: number of pages used on disk.
- ustlowts: when UPDATE STATISTICS was last run.
syscolumns: for indexed columns only:
- colmin: the second lowest value.
- colmax: the second highest value.
sysindices:
- levels: number of B-Tree levels.
- leaves: number of leaves.
- nunique: number of unique values in the first column.
- clust – incremented when values in the index are different to the last: max value is the number of rows; a low number indicates lots of duplicates.
- nrows: number of rows.
- ustlowts: when UPDATE STATISTICS was last run.
- ustbuildduration: time to build index statistics.

Updated by UPDATE STATISTICS MEDIUM or HIGH.
Consist of histograms for values or value ranges in equally sized buckets in sysdistrib.
Fragment-level statistics are stored in sysfragdist.

Auto Update Statistics (AUS) basics

Auto Update Statistics (AUS) consists of two database scheduler jobs. These are stored in the sysadmin database in table ph_task with configuration settings in ph_threshold.

Auto Update Statistics Evaluation: This calls a set of stored procedures which populate the sysadmin:aus_command table with a prioritized list of UPDATE STATISTICS commands to run. The code for these procedures is in $INFORMIXDIR/etc/sysadmin/sch_aus.sql
Auto Update Statistics Refresh: This is a UDR that does the work of calling the UPDATE STATISTICS commands. In older versions this was done with SPL code. Auto Update Statistics Refresh cannot be called even manually without the database scheduler running.

If your instance has a stop file ($INFORMXDIR/etc/sysadmin/stop) to prevent the scheduler initialising with the instance you must remove it. Before you do so it’s a good idea to review which jobs are enabled (tk_enable='t') in the ph_task table. (Using a stop file is a bad idea with 12.10 because running without the scheduler stops some system processes from functioning, so if you’re doing this, you ought to change it even if you don’t want to use AUS.)

One advantage of AUS is that it works on all of your databases, including sysmaster.

Instance parameters affecting UPDATE STATISTICS

Three onconfig parameters affect statistics maintenance, independently of the tool you use to update statistics:

Parameter	Description
AUTO_STAT_MODE	Controls whether to only update statistics or distributions the engine considers stale when running UPDATE STATISTICS.
STATCHANGE	Controls the percentage change beyond which statistics are considered stale if AUTO_STAT_MODE is enabled. This can be overridden at table level.
USTLOW_SAMPLE	Whether to use sampling for UPDATE STATISTICS LOW.

AUS specific parameters

AUS’s configuration settings stored in ph_threshold work independently of these system level settings.

These can be updated via SQL without using OpenAdmin Tool:

UPDATE sysadmin:ph_threshold SET value='0' WHERE name='AUS_CHANGE';

Parameter	Description
AUS_AGE	The number of days after gathering statistics that they are considered stale.
AUS_CHANGE	Prioritises tables based on the percentage of rows that have changed. This is not to be confused with STATCHANGE.
AUS_PDQ	The PDQPRIORITY to use while updating statistics. If you are using workgroup edition set this to 0 to avoid annoying messages in the online log about not being allowed to use PDQ.
AUS_SMALL_TABLES	The small table row threshold. These tables are considered volatile and updated on every run.
AUS_RULES	0 When set to zero, AUS updates any existing distributions only and won’t ensure that a minimum set of distributions exists according to rules. This allows you more fine-grained control of what you maintain. Be aware that creating an index automatically builds HIGH mode distributions on leading columns with data so if you don’t want these you would need to drop them. 1 Where no existing distributions exist, AUS will create: HIGH mode distributions on all leading index columns. MEDIUM mode distributions on other indexed columns. For most systems ‘1’ is the value you should use. If in doubt, use this.

You’ll notice in the description for AUS_RULES=1 that AUS does not automatically generate any distributions on non-indexed columns. However it will maintain any existing distributions regardless of the value of AUTO_AUTO_RULES, even if AUS wouldn’t have created them if they didn’t exist.

Non-indexed columns and distributions

If you are migrating from dostats you will have MEDIUM mode distributions on all non-indexed columns. It’s possible to drop these on individual columns using:

UPDATE STATISTICS LOW FOR TABLE <tab name> (<colname>) DROP DISTRIBUTIONS ONLY;

I’d test before doing anything like this so you may just choose to leave them in and let AUS maintain them. If UPDATE STATISTICS MEDIUM doesn’t take very long on your system this is probably the best choice.

Are these distributions important? My answer is that it depends but, more often than not, no. The Informix Performance Guide recommends you have them but this is just a general recommendation. At this point it’s important not to lose sight of the fact that the goal is not to have the most comprehensive, high-resolution and up to date statistics possible; it is to ensure you can guarantee that your queries always run efficiently. Given that we can’t be updating statistics and distributions on all columns all of the time some compromises are inevitable.

Often when running a query the main risk is the optimizer choosing to use sequential scans instead of an index. This risk is greatly reduced, if not eliminated, if the onconfig parameter OPTCOMPIND is set 0. The downside of this is that the optimizer won’t select a sequential scan when it is the best query plan available unless there are no other options.

In general distributions on a column are more useful if there is some skew in the data. However be aware that for non-indexed columns syscolumns.colmin and syscolumns.colmax are never populated by UPDATE STATISTICS LOW so the optimizer is truly blind about the data ranges without a distribution.

I’m going to run through an example now using this table:

people table			indices
person_id	serial	not null	ix_people1	(person_id)
name	varchar(254)	not null	ix_people2	(age)
gender	char(1)	not null
age	int	not null

The table will be populated with random ersatz data as follows:

1 million rows.
Ages evenly distributed between 5 and 84.
75% female, 25% male.

I’ll be running this query:

SELECT name FROM people WHERE age BETWEEN 18 AND ? AND gender=?

And tweaking these parameters:

Upper age limit in the query.
Gender in the query.
Whether I have a medium mode distribution on gender.
The value of OPTCOMPIND in the onconfig.

My results were as follows:

Upper age	Gender	Actual rows returned	Medium mode distribution on gender?	OPTCOMPIND	Estimated rows	Estimated cost	Plan
25	f	74817	No	0	9988	21104	Index path
			No	2	9988	21104	Index path
			Yes	0	75539	21104	Index path
			Yes	2	75539	21104	Index path
25	m	25061	No	0	9988	21104	Index path
			No	2	9988	21104	Index path
			Yes	0	24539	21104	Index path
			Yes	2	24539	21104	Index path
30	f	121748	No	0	16232	38477	Index path
			No	2	16232	33923	Sequential scan
			Yes	0	122439	38477	Index path
			Yes	2	122439	33923	Sequential scan
30	m	40572	No	0	16232	38477	Index path
			No	2	16232	33923	Sequential scan
			Yes	0	39881	38477	Index path
			Yes	2	39881	33923	Sequential scan

What conclusions can we draw from this example?

OPTCOMPIND was a determining factor, not the presence of a medium mode distribution on gender.
Having the distribution gave a much better estimate of the number of rows returned.
The optimizer never used a different plan for the male or female queries.

Of course this is one example and you may have some different ones.

Columns with incrementing values

Let’s illustrate a different point with another example.

Maintaining distributions on any sort of log or journal where there is a timestamp field can be a problem. The highest value in your distribution is wrong almost immediately after calculating it because new rows are being added all the time with later timestamps. This means that if you do a query over recently added data your distributions may tell the optimizer that there’s no data.

ledger table			indices
line_id	serial	not null	ix_ledger1	(line_id)
account_id	int	not null	ix_ledger2	(account_id, creation_time)
creation_time	datetime year to second	not null	ix_ledger3	(creation_time)
			ix_ledger4	(account_id)

for_processing table			indices
line_id	int	not null	ix_fp1	(line_id)
creation_time	datetime year to second	not null	ix_fp2	(creation_time)

Both of these tables have over 1 million rows and new rows being added continuously.

I am going to run this query:

SELECT  FIRST 5  l.line_id, l.creation_time  FROM  ledger l,  for_processing f  WHERE  f.line_id = l.line_id AND  l.account_id = 50 AND  f.creation_time BETWEEN  '2015-02-02 17:00:00' and '2015-02-02 21:00:00';

There are two conceivable ways to run this query:

Use the index on creation_time on for_processing, join to the ledger table on line_id and then filter on the account_id column.
Use the index on account_id on the ledger table, join by line_id and then filter on the creation_time column.

The risk with the first one is that a lot of rows are read, only to eliminate the vast majority of them when the account_id criterion is applied.

The optimizer may prefer to drive off creation_date, particularly if the distribution indicates there are no data past the time the distribution was gathered. This is because it believes (wrongly) that selecting from a date range in the for_processing table where it believes there is no data at all, avoiding the need to do any work, is more efficient than selecting out an account from the ledger and then joining the for_processing table.

This can be very slow for large date ranges. This is particularly true when there are a large number of accounts.

Can AUS help you here? Not really, this example is more to point out a danger. The risk of running into a problem like this can be massively increased if you use the default STATCHANGE value of 10. This is because here it is the age of the distributions that matters, not how much the data has changed.

My recommendation is:

In your onconfig either:

set AUTO_STAT_MODE to 0.

Or set STATCHANGE to 0 if AUTO_STAT_MODE is 1.

If there are tables for which this is not appropriate, do so at the table level:

ALTER TABLE <tabname> STATCHANGE <percent>;

In my view restricting updates only when there have been absolutely no changes is the only safe way.

There is an undocumented feature that can help here:

IC91678: Informix optimizer vulnerable to poor query response on incrementing columns.

However the fix is not switched on unless you set onconfig parameter: SQL_DEF_CTRL 0x2. This can be switched on dynamically with onmode -wm. With this switched on, date/time distributions are effectively extended into the future by up to 30 days. While the answer to the question how old can my statistics and distributions be is nearly always it depends, with this switched on there is a hard limit.

The date or datetime column concerned must have a default value of TODAY or CURRENT. The code also compares the last timestamp in the distribution with the time the distribution was gathered. The two must be close together to activate the feature.

This fix also works on serial, serial8 and bigserial fields.

This feature is a little tricky to test because you must:

Populate the tables with a large volume of data.
Update the distributions.
Add some more data with later timestamps.
Wait a while!

Here are my results:

	Date range
	Within the distribution bounds	Beyond the distribution upper bound	Beyond the distribution upper bound with IC91678
Query drives off	account_id	creation_time	account_id
Costs: lead with account_id	264	398	398
Costs: lead with creation_time	79004	4	398
Estimated rows	20	1	1
Actual rows	10	1	1

Wildcard queries and insufficient distribution resolution

Another problem you may have with the standard distributions created by UPDATE STATISTICS HIGH is insufficient resolution. By default the data are divided into 200 equal-sized buckets (resolution 0.5) and this may not suffice for some wildcard queries. The optimizer may be grossly wrong in its estimates for the number of rows returned and this can be improved by increasing the number of buckets.

customer table			indices
customer_id	serial	not null	ix_customer1	(customer_id)
user_name	varchar(50)	not null	ix_customer2	(user_name)

Again this table will be populated with random ersatz data as follows:

9 million rows.
This gives a distribution bucket size of 45000 with 0.5 resolution.

And my query:

SELECT FIRST 1000 customer_id, user_name FROM customer WHERE user_name LIKE 'BAN%' ORDER BY customer_id;

Look at the query carefully. It’s the sort of query you might get from a web customer search form. The ORDER BY on customer_id is important because it gives the engine the option of avoiding sorting any data if this index is used to select the data. If the optimizer thinks the user_name criterion is not very selective, i.e. there are a lot more than 1000 customers whose user name begins with the letters BAN and it will find them quickly without reading many rows, it may prefer this query plan.

There are two credible plans for this query:

Use the index on user_name, read all the rows it points to and then sort on customer_id, return the first 1000.
Use the unique index on customer_id, read the rows it points to, filter, stop at 1000 rows. This plan does not require a sort.

There is a third option of scanning the entire table but it’s unlikely the optimizer will choose this.

Let’s look at the results of running this query with a standard HIGH mode distribution on customer_name, a second distribution with ten times the resolution and no distribution at all.

	Distribution
Variable	HIGH 0.5 resolution	HIGH 0.05 resolution	No distributions
Selected index (no directive)	customer_id	customer_name	customer_id
Costs: lead with customer_id	3480	313249	174
Costs: lead with user_name	567525	1214	12435040
Estimated rows	90000	1000	1800000
Actual rows	1028	1028	1028

This example is real but also carefully selected because different three letter combinations may give different and more sane results, even with the standard distribution resolution.

If instead I run this query:

SELECT FIRST 1000 customer_id, user_name FROM customer WHERE user_name LIKE 'TDM%' ORDER BY customer_id;

I get:

	Distribution
Variable	HIGH 0.5 resolution	HIGH 0.05 resolution	No distributions
Selected index (no directive)	customer_name	customer_name	customer_id
Costs: lead with customer_id	135723	313249	174
Costs: lead with user_name	2909	1214	12435040
Estimated rows	2308	1000	1800000
Actual rows	801	801	801

AUS makes maintaining custom resolutions (or confidence levels with UPDATE STATISTICS MEDIUM) very easy. Simply run UPDATE STATISTICS manually to gather the desired distribution. AUS will maintain this resolution for you.

How frequently should I run UPDATE STATISTICS?

A common question a DBA may ask is how often is it necessary to UPDATE STATISTICS? A straightforward if unhelpful answer is often enough to ensure efficient query plans. More specifically:

On tables without columns with incrementing values, by which I mean that new values lie inside the range of the existing data, or where these columns are not used in the selection criteria for queries, it may be safe to use STATCHANGE (with AUTO_STAT_MODE) to regulate the frequency based on percentage change at the table level.
With incrementing values the working set can quickly get beyond the min/max values in the statistics or distributions. Here I’d recommend being much more aggressive and update based on age, regardless of the volume of change. This especially applies to your distributions.

If your system is small enough that you can run a full UPDATE STATISTICS every day there is no harm in doing this. It is probably overkill but it is one way of playing safe. To do this set AUS_CHANGE to 0 and make sure both the scheduler and the evaluator run daily. For larger systems you do need to be more selective about how often you run UPDATE STATISTICS.

Monitoring AUS

Does AUS just work? Well yes, it was aimed at the embedded market and it is pretty robust. On a larger system there is more to go wrong and so I’d recommend you check that:

the parameters are set correctly (in case they change if you rebuild the sysadmin database, for example).
there are no errors running UPDATE STATISTICS.
all statistics and distributions are sufficiently up to date.
the window you give the refresh job to run in is long enough.

Check	Notes
Correct parameters	Job scheduler running and dbScheduler thread exists. Evaluator and refresh tasks are enabled and have the expected schedules. AUS_AGE, AUS_CHANGE, AUS_AUTO_RULES, AUS_PDQ and AUS_SMALL_TABLES set correctly.
No errors	No errors reported in the aus_command table (`aus_cmd_state='E'`).
Statistics and distributions sufficiently up to date	Query system tables to get the age of statistics and distributions. Fairly complex check, particularly if using STATCHANGE>0. Then need to consider ninserts, nupdates and ndeletes columns. Statistics: Find tables where systables.ustlowts older than your threshold and then check whether it is in the result set of this SQL: SELECT {+ORDERED} st.tabname FROM sysdistrib sd, systables st, sysmaster:sysptnhdr sp WHERE sd.tabid = st.tabid AND st.partnum = sp.partnum AND sd.tabid > 99 AND st.ustlowts > <days old> AND (sp.ninserts - sd.ninserts) + (sp.nupdates - sd.nupdates) + (sp.ndeletes - sd.ndeletes) > 0 AND ( st.statchange IS NOT NULL AND ( st.statchange = 0 OR ((sp.ninserts - sd.ninserts) + (sp.nupdates - sd.nupdates) + (sp.ndeletes - sd.ndeletes))/st.nrows100 > st.statchange ) OR ( st.statchange IS NULL AND ( <onconfig STATCHANGE parameter> = 0 OR ((sp.ninserts - sd.ninserts) + (sp.nupdates - sd.nupdates) + (sp.ndeletes - sd.ndeletes))/st.nrows100 > <onconfig STATCHANGE parameter>) ) ) GROUP BY st.tabname UNION SELECT {+ORDERED} st.tabname FROM sysdistrib sd, systables st, sysfragments sf, sysmaster:sysptnhdr sp WHERE sd.tabid = st.tabid AND st.tabid = sf.tabid AND sf.partn = sp.partnum AND sd.tabid > 99 AND st.statchange is not null AND sf.fragtype='T' AND st.ustlowts > <days old> GROUP BY st.tabname HAVING (SUM(sp.ninserts) - AVG(sd.ninserts)) + (SUM(sp.nupdates) - AVG(sd.nupdates)) + (SUM(sp.ndeletes) - AVG(sd.ndeletes)) > 0 AND ( MIN(st.statchange) = 0 OR (SUM(sp.ninserts) - AVG(sd.ninserts)) + (SUM(sp.nupdates) - AVG(sd.nupdates)) + (SUM(sp.ndeletes) - AVG(sd.ndeletes))/AVG(st.nrows)100 > MIN(st.statchange) ) Distributions: SELECT DISTINCT tabname, colname, TODAY - sd.constr_time st.nrows FROM sysdistrib sd JOIN systables st ON (sd.tabid = st.tabid) JOIN syscolumns sc ON (sd.tabid = sc.tabid AND sd.colno = sc.colno) JOIN sysmaster:sysptnhdr sph ON (sph.partnum = st.partnum) WHERE sd.tabid > 99 AND (sph.ninserts - sd.ninserts) + (sph.nupdates - sd.nupdates) + (sph.ndeletes - sd.ndeletes) > 0 AND ( st.statchange IS NOT NULL AND ( st.statchange = 0 OR ((sph.ninserts - sd.ninserts) + (sph.nupdates - sd.nupdates) + (sph.ndeletes - sd.ndeletes))/st.nrows100 > st.statchange ) OR ( st.statchange IS NULL AND ( <onconfig STATCHANGE parameter> = 0 OR ((sph.ninserts - sd.ninserts) + (sph.nupdates - sd.nupdates) + (sph.ndeletes - sd.ndeletes))/st.nrows*100 > <onconfig STATCHANGE parameter>) ) ) AND (TODAY - DATE(sd.constr_time)) >= <days old> ORDER BY 4, 3 DESC
Long enough window in which to run commands	No pending commands at the end of the refresh task window (`aus_command.aus_cmd_state='P'`);

UPDATE STATISTICS FOR PROCEDURE

How does AUS deal with stored procedure plans as stored in sysprocplan? Well it doesn’t directly and does not call UPDATE STATISTICS FOR PROCEDURE [procname].

My take on this is that routines referencing updated tables will be recompiled anyway the first time they are called immediately after running UPDATE STATISTICS. On a busy OLTP system this will probably happen before you have chance to update procedure plans manually. If you do have reason to do this, you will need to do it manually and if you do don’t set PDQPRIORITY.

If your system does have dead time there may be a small advantage to running this but I don’t think it really matters that much.

Method for running UPDATE STATISTICS [LOW]

Let’s now look at how AUS actually calls UPDATE STATISTICS. As discussed earlier the evaluation task creates a list of commands to run and these are run by the refresh task exactly as you see them when you query the aus_command table.

Let’s start by considering UPDATE STATISTCS [LOW].

AUS simply calls UPDATE STATISTICS [LOW] as one command without specifying any column names and I have seen it suggested that this is a bad thing (I don’t agree). The popular alternative, dostats, runs UPDATE STATISTICS LOW separately for each index key. Performance-wise there is not as much difference as you might expect, I suspect this is because data tend to be in the buffer cache on repeated calls. But is there any difference to the end result?

Using Informix 11.70.FC7W1 I performed the following test:

Created two identical empty tables with twelve indices and loaded both with the same data.
Ran AUS on one, dostats on the other.
Unloaded and compared systables, syscolumns and sysindices for both tables.

Apart from tabid, tabname, index names and build times, the unloads were identical. My conclusion is that is no difference and AUS is slightly faster because it is done with one command in a single pass.

Performance of UPDATE STATISTICS [LOW]

In terms of performance there is actually very little you can do to influence the speed of UPDATE STATISTICS LOW. The parameters DBUPSPACE, PSORT_NPROCS and even PDQPRIORITY have no effect.

Setting USTLOW_SAMPLE in the onconfig file is the only performance optimisation available apart from general server tuning.

It is supposed to be possible to get parallel scan threads for all fragmented indices when PDQPRIORITY is set to at least 10 but I can’t reproduce this. This only works with USTLOW_SAMPLE is switched off.

Performance of UPDATE STATISTICS MEDIUM or HIGH

For UPDATE STATISTICS HIGH and MEDIUM there are a few parameters that can influence performance and with the exception of PDQPRIORITY you can’t use any of them with AUS. In summary they are:

DBUPSPACE: three parameters in one controlling four things:
- how much memory and disk are available for sorting when PDQPRIORITY is not set.
- whether to use indices for sorting.
- whether to print the explain plan.
DBSPACETEMP: overrides default onconfig value.
PSORT_DBTEMP: allows you to specify a filesystem instead of DBSPACETEMP.
PSORT_NPROCS: specifies number of threads for sorting.

Having done tests with this parameter:

The disk and memory parameters don’t have any effect if you set PDQPRIORITY. If you do up to DS_TOTAL_MEMORY multiplied by the effective PDQPRIORITY (as a percentage) can be allocated.
Even if set, they only have the effect of restricting the number of columns that can be processed in a single pass so a lot of the time they make no difference. For a better understanding of how this works see John Miller III’s article on Understanding and Tuning Update Statistics.
Setting the explain output via the directive setting works a bit weirdly: you need to run set explain on separately to enable the explain plan even if set here but disabling it with this parameter does work.
The default is to use indices for sorting. Switching this off (directive 1 or 2) was up to 5x slower on my test system.

Here are some performance tests showing the effects of different DBUPSPACE settings on UPDATE STATISTICS HIGH. My performance tests are all carried out on a powerful 32 core Linux x86_64 server with 256 Gb RAM and fast SAN storage (not my laptop). For my tests I did a dummy run first to try and even out any effects of caching in the SAN. I then did a run immediately after initialising my instance (cold) and a repeat run (warm) for each result.

10 million row non-partitioned table with 12 single-column indices:

DBUPSPACE	PDQPRIORITY	Memory allocated (Mb)	No. of scans	Indices used for sorting?	Light scans used?	Time elapsed cold (s)	Time elapsed warm (s)
1024:15:0 (default)	0	15	12	Yes	No	111	97
0:50:0	0	50	12	Yes	No	108	97
0:50:1	0	50	12	No	No	259	254
0:50:0	100	2462.7	1	Yes	Yes	172	180
0:50:1	100	2462.7	1	No	Yes	176	172

74 million row table with 8-way round-robin partitioning and 3 single-column indices:

DBUPSPACE	PDQPRIORITY	Memory allocated (Mb)	No. of scans	Indices used for sorting?	Light scans used?	Time elapsed cold (s)	Time elapsed warm (s)
1024:15:0 (default)	0	15	3	Yes	No	227	169
0:50:0	0	50	3	Yes	No	224	168
0:50:1	0	50	3	No	No	501	494
0:50:0	100	3064.7	1	Yes	Yes	432	426
0:50:1	100	3064.7	1	No	Yes	425	428

What conclusions can we draw from these results?

Using PDQPRIORITY actually makes it slower.
There is no significant difference between using 15 Mb of memory for sorts and 50 Mb of memory for sorts. I suspect this is because the number of scans is the same.
Using indices for sorting (the default) is faster than not using indices for sorting when not using PDQ.
The use of light scans (avoiding the buffer cache) reduces the variation between the cold and warm results.

Despite having a partitioned table and PDQPRIORITY set, the interesting thing here is that during the execution I observed no parallelism. You can see this for yourself by identifying the session running UPDATE STATISTICS and looking at the threads the session is using. I get something like this for my partitioned table:

IBM Informix Dynamic Server Version 11.70.FC7W1 -- On-Line -- Up 00:01:11 -- 228250944 Kbytes

session           effective                            #RSAM    total      used       dynamic 
id       user     user      tty      pid      hostname threads  memory     memory     explain 
14       informix -         -        28798    guebobdb 1        11563008   10603832   off 

Program :
/opt/informix-11.70.FC7W1/bin/dbaccess

tid      name     rstcb            flags    curstk   status
201      sqlexec  26e1fbbfc8       ---PR--  24752    running-

Memory pools    count 4
name         class addr              totalsize  freesize   #allocfrag #freefrag 
14           V     26e4c59040       139264     10008      176        12        
14_SORT_0    V     26e573d040       4108288    294896     34865      4         
14_SORT_1    V     26e5748040       3657728    326176     32934      4         
14_SORT_2    V     26e5730040       3657728    325936     32937      4         

name           free       used           name           free       used      
overhead       0          13152          scb            0          144       
opentable      0          10640          filetable      0          3016      
ru             0          600            misc           0          160       
log            0          16536          temprec        0          21664     
blob           0          832            keys           0          664       
ralloc         0          19320          gentcb         0          1592      
ostcb          0          2944           sort           0          1577072   
sqscb          0          20224          sql            0          72        
srtmembuf      0          817320         hashfiletab    0          552       
osenv          0          3000           buft_buffer    0          8736      
sqtcb          0          9680           fragman        0          3640      
shmblklist     0          8074432        

sqscb info
scb              sqscb            optofc   pdqpriority optcompind  directives
e28f91c0         e4c5a028         0        100         0           1         

Sess       SQL            Current            Iso Lock       SQL  ISAM F.E. 
Id         Stmt type      Database           Lvl Mode       ERR  ERR  Vers  Explain    
14         UPDATE STATIST testdb             DR  Wait 15    0    0    9.24  On         

Current SQL statement (6) :
  update statistics high for table large_8-way_table
    (id_col1,id_col2,date_col) distributions only force

Last parsed SQL statement :
  update statistics high for table large_8-way_table
    (id_col1,id_col2,date_col) distributions only force

In the above example we can see that the engine has allocated three memory pools, one each for the three columns we are updating in a single pass with PDQPRIORITY set.

If you do see parallel sort (PSORT) threads like this:

Program :
/opt/informix-11.70.FC7W1/bin/dbaccess

tid      name     rstcb            flags    curstk   status
370      sqlexec  26e1fbaf28       ---PR--  20688    running-
371      psortpro 26e1fde468       Y------  912      cond wait  incore    -
372      psortpro 26e1fcda68       Y------  912      cond wait  incore    -
373      psortpro 26e1fd88f8       Y------  912      cond wait  incore    -
374      psortpro 26e1fd1498       Y------  912      cond wait  incore    -
375      psortpro 26e1fc6608       Y------  912      cond wait  incore    -
376      psortpro 26e1fdc328       Y------  912      cond wait  incore    -
377      psortpro 26e1fd4ec8       Y------  912      cond wait  incore    -
378      psortpro 26e1fca038       Y------  912      cond wait  incore    -
379      psortpro 26e1fbe958       Y------  912      cond wait  incore    -
380      psortpro 26e1fdbad8       Y------  912      cond wait  incore    -
381      psortpro 26e1fcb0d8       Y------  912      cond wait  incore    -
382      psortpro 26e1fda1e8       Y------  912      cond wait  incore    -
383      psortpro 26e1fc97e8       Y------  912      cond wait  incore    -
384      psortpro 26e1fd5718       Y------  912      cond wait  incore    -
385      psortpro 26e1fcd218       Y------  912      cond wait  incore    -
386      psortpro 26e1fc2388       Y------  912      cond wait  incore    -
387      psortpro 26e1fd80a8       Y------  912      cond wait  incore    -
388      psortpro 26e1fd0c48       Y------  912      cond wait  incore    -
389      psortpro 26e1fc5db8       -------  1632     running-
390      psortpro 26e1fdf508       Y------  912      cond wait  incore    -
391      psortpro 26e1fd7858       Y------  912      cond wait  incore    -
392      psortpro 26e1fc6e58       Y------  912      cond wait  incore    -
393      psortpro 26e1fd35d8       Y------  912      cond wait  incore    -
394      psortpro 26e1fdcb78       -------  8        running-
395      psortpro 26e1fca888       -------  1632     running-
396      psortpro 26e1fbf9f8       -------  1632     running-
397      psortpro 26e1fcc9c8       -------  8        running-
398      psortpro 26e1fc1b38       -------  8        running-
399      psortpro 26e1fd1ce8       -------  8        running-

Memory pools    count 4
name         class addr              totalsize  freesize   #allocfrag #freefrag 
41           V     270a6cd040       954368     61560      438        53        
41_SORT_0    V     2706fc2040       3915776    2184       33415      10        
41_SORT_1    V     2706fbb040       4222976    14488      31589      27        
41_SORT_2    V     2706fbc040       4222976    14248      31592      27

It is either because PSORT_NPROCS is set in your environment or set in the database engine’s environment when it was started.

Let’s now looking at the effect of PSORT_NPROCS. The only way to use this with AUS is to set PSORT_NPROCS when you start the database engine, which will of course affect all sessions.

Setting PSORT_NPROCS is the only way to get any parallel processing with UPDATE STATISTICS HIGH or MEDIUM. It has no effect on UPDATE STATISTICS LOW. Setting PDQPRIORITY only provides more memory and allows HIGH and MEDIUM mode distributions on multiple columns to be built in a single pass, if enough memory is available. There will be one SORT memory pool per column being processed regardless of PDQPRIORITY. Sorting with PSORT_NPROCS set can be faster as we’ll see now.

Carrying on with the same examplse as above, I get these results:

10 million row non-partitioned table with 12 single-column indices:

DBUPSPACE	PDQPRIORITY	PSORT_ NPROCS	Memory allocated (Mb)	No. of scans	Light scans used?	Time elapsed cold (s)	Time elapsed warm (s)
0:50:0	0	Not set	50	12	No	108	97
0:50:0	0	24	50	12	No	123	97
0:50:0	100	Not set	2462.7	1	Yes	172	180
0:50:0	100	24	2538.4	1	Yes	80	83

74 million row table with 8-way round-robin partitioning and 3 single-column indices:

DBUPSPACE	PDQPRIORITY	PSORT_ NPROCS	Memory allocated (Mb)	No. of scans	Light scans used?	Time elapsed cold (s)	Time elapsed warm (s)
0:50:0	0	Not set	50	3	No	224	168
0:50:0	0	24	50	3	No	223	170
0:50:0	100	Not set	3064.7	1	Yes	432	426
0:50:0	100	24	3083.7	1	Yes	156	145

What conclusions can we draw this time?

The fastest UPDATE STATISTICS HIGH (or MEDIUM) performance is with PDQ and PSORT_NPROCS set. (Seasoned DBAs might have expected this result).
But it’s not much faster than running without either of these parameters set, probably with fewer server resources.
It’s worth bearing in mind that PDQ enables light scans which may avoid buffer cache churn.

Adding additional AUS Refresh tasks for greater throughput

There is another way to achieve parallelism with AUS and that is to add additional scheduler tasks so that more than one table can be worked on at once, for example:

INSERT INTO sysadmin:ph_task ( tk_name, tk_description, tk_type, tk_dbs, tk_execute, tk_delete, tk_start_time, tk_stop_time, tk_frequency, tk_monday, tk_tuesday, tk_wednesday, tk_thursday, tk_friday, tk_saturday, tk_sunday, tk_group, tk_enable ) VALUES ( 'Auto Update Statistics Refresh #2', 'Refreshes the statistics and distributions which were recommended by the evaluator.', 'TASK', 'sysadmin', 'aus_refresh_stats', '0 01:00:00', '03:11:00', '12:45:00', '1 00:00:00', 't', 't', 't', 't', 't', 'f', 'f', 'PERFORMANCE', 't' );

It is vital that the task name begins with Auto Update Statistics Refresh as shown here otherwise some of the internal code that stops the evaluator from running at the same time as the refresh tasks will not work.

Think of it as a batch process where the goal is to update all your statistics and distributions, not doing an individual table as fast as possible.

I recommend this method if you need extra grunt! Then run without PDQ (AUS_PDQ = 0).

Pros and cons of using PSORT_DBTEMP

Let’s move to consider another environment variable you can’t easily use: PSORT_DBTEMP. Is using a filesystem for sorts faster than temporary dbspaces? For this I am not going to do any performance tests largely because a comparison between the local SATA disks used for filesystems and the battery-backed, part solid state SAN used for the database storage on my server is not a fair fight.

If you want to use PSORT_DBTEMP with AUS, again you will need to set it in your environment when initialising the server and use it across your system.

The only definitive benefit of pointing to a filesystem using PSORT_DBTEMP instead of using DBSPACETEMP is that the Linux filesystem cache has an effect. This means that your temporary files may never be committed to disk, giving you a performance advantage. Another interesting alternative is to use a RAM disk.

Otherwise when considering whether to use a filesystem over temporary dbspaces, I would consider your hardware.

Recent UPDATE STATISTICS defects

Below are some defects relating to UPDATE STATISTICS that I have personally logged with IBM. All are general UPDATE STATISTICS issues and whether you use AUS or not does not affect your chances of hitting these. I have written some comments underneath.

IT06767 UPDATE STATISTICS HIGH ON TABLE BIGGER THAN 1.1 BIO ROWS CAN CAUSE EXCEPTION IN SRTQUICK: This defect can only be hit if you use PDQ and a large amount of memory is allocated.
IT06726 Assert Warning Invalid index statistics found when using statistics sampling with index having many deleted items: Requires USTLOW_SAMPLE to be switched on. It is fairly harmless but does write some garbage to sysfragments. Re-running UPDATE STATISTICS LOW usually fixes it.
IT02679 UPDATE STATISTICS HIGH ON A FRAGMENTED TABLE LEADS TO ERROR MESSAGE 9810: This can only be seen if using fragment level statistics.
IT05463 Fragment based update stats high consumes a huge amount of sblobspace: This can only be seen if using fragment level statistics. It is not worth trying to make your sbspace extremely large to work around this problem.
IT05639 -768: Internal error in routine mbmerge2bin when running update statistics on a fragmented table (running fragment-level statistics)
: Another fragment level statistics one. With the fixes for IT02679, IT05463 and IT05639 this feature is solid. I wouldn’t enable this for a table unless you have all of these in your version.

Most of these defects are fixed in 12.10.FC5.

UPDATE STATISTICS performance summary

Overall then what seems to be the best balance between speed and use of resources on your system is:

In terms of AUS, leave AUS_PDQ at 0.
If you need more throughput add more Refresh tasks to the scheduler.
That’s it.
Set DBUPSPACE=0:50:0 in the server environment.

I hope this has been useful. I’ll end with a quick summary of some of the more interesting performance points from this presentation which apply regardless of whether you use AUS or not:

Setting PDQPRIORITY only provides more memory for UPDATE STATISTICS HIGH or MEDIUM and does not provide parallel processing. It may even make it run slower.
If you want multi-threaded processing, this only works by setting PSORT_NPROCS and then only with UPDATE STATISTICS HIGH or MEDIUM. Because it does not require PDQ you can use this with workgroup edition.
In my tests only when used together do PDQPRIORITY and PSORT_NPROCS improve performance. PDQ does switch on light scans though which avoid churning your buffer cache..
Not using indices for sorting for UPDATE STATISTICS HIGH or MEDIUM can be significantly slower.
The performance of UPDATE STATISTICS [LOW] can be improved by setting USTLOW_SAMPLE.

Temporary tables

Posted: 29 March, 2015 | Author: Ben Thompson | Filed under: Instance and operating system tuning, Monitoring, SQL query tuning | Tags: CREATE TEMP TABLE, dbspace, DBSPACETEMP, SELECT INTO TEMP, TEMPTAB_NOLOG | 3 Comments

Again, this is another blog post about nothing new at all but an attempt to put down my understanding of temporary tables down in a way that will help me when I refer back to it and hopefully be of more wider use.

When looking at things like this a test system is always essential for checking things and finding some surprising results. If you don’t have one you can soon set one up with VMWare or Virtual Box and a copy of Informix Developer Edition. For this post I am going to set up a simple test instance with four specific dbspaces:

A logged rootdbs with plenty of free space, called rootdbs.
A logged dbspace for my data and indices, called datadbs.
A temporary dbspace without logging, called tmp2k_unlogged.
A temporary dbspace with logging, called tmp2k_logged.

What’s the difference between a temporary dbspace with logging and a normal dbspace? Nothing except that one appears in the DBSPACETEMP onconfig setting.

Consider the SQL statements which explicitly create temporary tables in a session:

create temp table unlogged (col1 int) with no log; insert into unlogged values (1);

and:

create temp table logged (col1 int); insert into logged values (1);

For these tests we must have TEMPTAB_NOLOG set to 0 in our onconfig otherwise the second statement will silently have a with no log criterion added to it.

Let’s run these, use oncheck -pe to see which dbspace they get placed in and then use onlog -n <starting log unique identifier> to see if changes to these tables get logged or not:

DBSPACETEMP setting	unlogged table		logged table
DBSPACETEMP setting	dbspace used	logged operations	dbspace used	logged operations
tmp2k_unlogged:tmp2k_logged	tmp2k_unlogged	no	tmp2k_logged	yes
tmp2k_unlogged	tmp2k_unlogged	no	datadbs	yes
tmp2k_logged	tmp2k_logged	no	tmp2k_logged	yes
NONE	datadbs	no	datadbs	yes

So already a few interesting results drop out:

We can specify both logged and unlogged dbspaces using the DBSPACETEMP parameter.
The engine will prefer logged dbspaces for logged tables and unlogged dbspaces for unlogged tables.
If an unlogged dbspace is not available the engine can use a logged dbspace and create unlogged tables in it.
If a logged dbspace is not available the engine will use an alternative dbspace because an unlogged dbspace does not logging at all. In this case it has chosen datadbs, because this is the dbspace in which I created my database.

At this point it’s worth referring to an IBM technote on this subject. This suggests some more tests but already my results are not in agreement with the first example given:

If we have created a dbspace named tmpdbs, but we could not see it was marked as ‘T’ in the result of onstat -d. We set DBSPACETEMP configuration parameter to tmpdbs. On this condition, tmpdbs will be used for logged temporary tables. That means if a temp table is created with ‘WITH NO LOG’ option, the server will not use it.

This is implying that my tmp2k_logged dbspace (it will not have the ‘T’ flag) cannot be used for unlogged temporary tables. You can see from my table that this isn’t true and I invite you to test this for yourself.

As part proof here is the onstat -d and oncheck -pe output:

$ onstat -d | grep tmp2k
7f9d7310         3        0x60001    3        1        2048     N  BA    informix tmp2k_logged
7f9d74b8         4        0x42001    4        1        2048     N TBA    informix tmp2k_unlogged


DBspace Usage Report: tmp2k_logged        Owner: informix  Created: 03/08/2015


 Chunk Pathname                             Pagesize(k)  Size(p)  Used(p)  Free(p)
     3 /informix_data/sandbox/tmp2k                   2      512       61      451

 Description                                                   Offset(p)  Size(p)
 ------------------------------------------------------------- -------- --------
 RESERVED PAGES                                                       0        2
 CHUNK FREELIST PAGE                                                  2        1
 tmp2k_logged:'informix'.TBLSpace                                     3       50
 db1:'thompsonb'.unlogged                                            53        8
 FREE                                                                61      451

 Total Used:       61
 Total Free:      451

Moving on, let’s do a slightly different test:

select * from sysmaster:systables into temp mytab_unlogged with no log;

and:

select * from sysmaster:systables into temp mytab_logged;

And the results:

DBSPACETEMP setting	unlogged table		logged table
DBSPACETEMP setting	dbspace used	logged operations	dbspace used	logged operations
tmp2k_unlogged:tmp2k_logged	tmp2k_unlogged	no	tmp2k_logged	yes
tmp2k_unlogged	tmp2k_unlogged	no	datadbs	yes
tmp2k_logged	tmp2k_logged	no	tmp2k_logged	yes
NONE	datadbs	no	datadbs	yes

Again my results are in disagreement with the IBM technote which says:

If DBSPACETEMP is not specified, the temporary table is placed in either the root dbspace or the dbspace where the database was created. SELECT…INTO TEMP statements place the temporary table in the root dbspace. CREATE TEMP TABLE statements place the temporary table in the dbspace where the database was created.

In both cases in my tests the engine chose datadbs and not my root dbspace.

Let’s move on a bit. How do I know what temporary tables are in use on my system as a whole?

One way is to run something like:

onstat -g sql 0 | grep -A 3 'User-created Temp tables'

This might get you something like this:

User-created Temp tables :
  partnum  tabname            rowsize 
  400003   mysystables2       500
  400002   mysystables        500

Another way is to run oncheck -pe and have a look at what is in your temporary dbspaces. Here you may also see space used by the engine for sorting, marked with SORTTEMP, or temporary tables created implicitly by the engine for query processing. However whatever type of object it is, you will find it impossible to match anything you see to a particular session by this method; it is only possible to match to a user name which would only allow positive identification if there was only a single session per user.

There is another way to match tables to sessions which works for explicitly created temporary tables, for which I don’t want to claim any credit because I cribbed it from the IIUG Software Repository. The script is called find_tmp_tbls and it its present state is broken when used with 11.70 (and hasn’t been tested since version 9.30.FC2 according to its README): at least it does not work with 64-bit Linux, mainly because the syntax for onstat -g dmp seems to have changed slightly. I managed to fix it up, however.

It’s a little complicated to follow but the basic steps are this:

You need to start with a given session and check if it has any temporary tables. (Unfortunately I don’t know a way of working backwards from the temporary table to see which session it belongs to either through onstat or the SMI interface.)
Get the session’s rstcb value, either from onstat -g ses <sid> or from the first column in onstat -u.
Run onstat -g dmp 0x<rstcb> rstcb_t | grep scb. Note that the rstcb value must be prefixed by 0x.This should return an scb value in hex.
Take this value and run onstat -g dmp <scb> scb_t | grep sqscb. Your address must start with 0x again and this is true throughout all the examples. This will return two values; take the one labelled just sqscb.
Feed this value into another dmp command: onstat -g dmp <sqscb> sqscb_t | grep dicttab. This will return another value.
Finally take this and get the partnum(s) of the temporary tables for the session by running: onstat -g dmp <dicttab> "ddtab_t,LL(ddt_next)" | grep partnum.

Here is all that as an example:

$ onstat -g dmp 0x2aaaab608488 rstcb_t | grep scb
    scb          = 0x2aaaabf5f1c0
$ onstat -g dmp 0x2aaaabf5f1c0 scb_t | grep sqscb
    sqscb        = 0x2aaaad6c8028
    sqscb_poolp  = 0x2aaaad6c92c8
$ onstat -g dmp 0x2aaaad6c8028 sqscb_t | grep dicttab
    dicttab      = 0x2aaaad87c4f0
$ onstat -g dmp 0x2aaaad87c4f0 "ddtab_t,LL(ddt_next)" | grep partnum
    ddt_partnum  = 4194307
    ddt_partnum  = 4194306

It’s worth emphasising that this method will only work for explicitly created temporary tables. It won’t identify temporary space used by:

implicitly created temporary tables created by the engine to process a query.
temporary sort segments.

If there is a similar method for these types, I would be interested to find out about it.

Armed with the partnum you can do whatever you want with it like run this query against the sysmaster database to see what space is being used:

SELECT
  tab.owner,
  tab.tabname,
  dbsp.name dbspace,
  te_chunk chunk_no,
  te_offset offset,
  te_size size
FROM
  systabnames tab,
  systabextents ext,
  syschunks ch,
  sysdbspaces dbsp
WHERE
  tab.partnum in (4194306, 4194307) AND
  ext.te_partnum=tab.partnum AND
  ch.chknum=ext.te_chunk AND
  dbsp.dbsnum=ch.dbsnum
ORDER BY
  tab.owner,
  tab.tabname,
  te_extnum;

Giving results like:

owner	tabname	dbspace	chunk_no	offset	size
thompsonb	mysystables	tmp2k_unlogged	6	53	8
thompsonb	mysystables2	tmp2k_unlogged	6	61	8

For reference there is information about explicit and implicit temporary tables and temporary sort space in these tables in the sysmaster database:

sysptnhdr
sysptnext
sysptnbit
sysptntab
sysptprof
systabextents
systabinfo
systabnames
systabpagtypes

So in conclusion I hope this post brings together some useful information about explicit temporary tables. Personally I’d like to be able to get a complete picture of which sessions what statements are using temporary space, which this doesn’t give. If I find anything it will be subject of a future blog post.

Random postings about IBM Informix

Introduction

Do I need one?

What are the benefits?

What other symptoms are there of spin waits caused by index root node contention?

Is the implementation as good as it could be?

Summary

Introduction

Why Grafana?

What to monitor

Implementation

Practical demonstration with Docker containers

Making this ready for production

Conclusion

Introduction

Statistics and distributions

Auto Update Statistics (AUS) basics

Instance parameters affecting UPDATE STATISTICS

AUS specific parameters

Non-indexed columns and distributions

Columns with incrementing values

Wildcard queries and insufficient distribution resolution

How frequently should I run UPDATE STATISTICS?

Monitoring AUS

UPDATE STATISTICS FOR PROCEDURE

Method for running UPDATE STATISTICS [LOW]

Performance of UPDATE STATISTICS [LOW]

Performance of UPDATE STATISTICS MEDIUM or HIGH

Adding additional AUS Refresh tasks for greater throughput

Pros and cons of using PSORT_DBTEMP

Recent UPDATE STATISTICS defects

UPDATE STATISTICS performance summary

Recent Posts

Categories

Archives

Other Informix blogs

Other links

Meta