Experience with Auto Update Statistics (AUS)
Posted: 30 April, 2015 | Author: Ben Thompson | Filed under: Instance and operating system tuning, Monitoring, SQL query tuning | Tags: AUS, Auto Update Statistics, AUTO_STAT_MODE, DBUPSPACE, IC91678, PSORT_DBTEMP, PSORT_NPROCS, STATCHANGE, USTLOW_SAMPLE | 1 CommentIntroduction
This is based on my previous blog post, Working with auto update statistics which I’ve expanded and significantly improved. I presented this at the Informix International User Group Conference 2015.
Let’s start at the beginning. Why do we run UPDATE STATISTICS at all? When we write an SQL query and send it to the database engine to execute, there may be several ways that the engine can run the query. For example if there are two tables the engine can start with the first table and join the second to it or begin with the second and join the first. There may be also two filter conditions, one of which may very specific and pick out only a few rows; the other may be very general. It should be apparent that some ways are more efficient than others, sometimes by several orders of magnitude.
Informix uses a cost-based optimizer to determine how to run queries. This relies on metadata to provide information about how large tables are, how many distinct values there are and other information about your data. We call these pieces of information statistics and if we also have a histogram of a column showing the abundance of specific values or ranges we call this a distribution.
The optimizer looks at possible query plans and chooses the one with the lowest costs assigned to it, according to the statistics and distributions. This may or may not be the best plan possible; if not you may see poor performance. Inaccurate cost predictions could be because your statistics are inadequate or out of date.
Maintaining the statistics and distributions is a key DBA responsibility.
What statistics and distributions should you maintain? The purpose is to ensure the optimizer selects the most efficient query plan for your queries.
These query plans should be stable over time. Normally stability is achieved through not changing things: this is why your Change Manager likes to say no (sometimes). However, with UPDATE STATISTICS stability comes from regularly refreshing your statistics and distributions: this is a change to your system you may be doing daily.
What statistics do you need? The Informix Performance Guide offers a general set of recommendations. However:
- They may be more than you need and therefore more to maintain, which could be a headache with a large system.
- In some specific cases they may not be good enough. The result of this can be an application codebase full of query directives (instructions to the optimizer to run queries in a particular way).
- The guide doesn’t say much about how frequently you should run UPDATE STATISTICS.
Statistics and distributions
Statistics | Distributions |
---|---|
|
|
Auto Update Statistics (AUS) basics
Auto Update Statistics (AUS) consists of two database scheduler jobs. These are stored in the sysadmin database in table ph_task with configuration settings in ph_threshold.
- Auto Update Statistics Evaluation
- This calls a set of stored procedures which populate the sysadmin:aus_command table with a prioritized list of UPDATE STATISTICS commands to run. The code for these procedures is in
$INFORMIXDIR/etc/sysadmin/sch_aus.sql
- Auto Update Statistics Refresh
- This is a UDR that does the work of calling the UPDATE STATISTICS commands. In older versions this was done with SPL code. Auto Update Statistics Refresh cannot be called even manually without the database scheduler running.
If your instance has a stop file ($INFORMXDIR/etc/sysadmin/stop
) to prevent the scheduler initialising with the instance you must remove it. Before you do so it’s a good idea to review which jobs are enabled (tk_enable='t'
) in the ph_task table. (Using a stop file is a bad idea with 12.10 because running without the scheduler stops some system processes from functioning, so if you’re doing this, you ought to change it even if you don’t want to use AUS.)
One advantage of AUS is that it works on all of your databases, including sysmaster.
Instance parameters affecting UPDATE STATISTICS
Three onconfig parameters affect statistics maintenance, independently of the tool you use to update statistics:
Parameter | Description |
---|---|
AUTO_STAT_MODE | Controls whether to only update statistics or distributions the engine considers stale when running UPDATE STATISTICS. | STATCHANGE | Controls the percentage change beyond which statistics are considered stale if AUTO_STAT_MODE is enabled. This can be overridden at table level. |
USTLOW_SAMPLE | Whether to use sampling for UPDATE STATISTICS LOW. |
AUS specific parameters
AUS’s configuration settings stored in ph_threshold work independently of these system level settings.
These can be updated via SQL without using OpenAdmin Tool:
UPDATE
sysadmin:ph_threshold
SET
value='0'
WHERE
name='AUS_CHANGE';
Parameter | Description |
---|---|
AUS_AGE | The number of days after gathering statistics that they are considered stale. |
AUS_CHANGE | Prioritises tables based on the percentage of rows that have changed. This is not to be confused with STATCHANGE. |
AUS_PDQ | The PDQPRIORITY to use while updating statistics. If you are using workgroup edition set this to 0 to avoid annoying messages in the online log about not being allowed to use PDQ. |
AUS_SMALL_TABLES | The small table row threshold. These tables are considered volatile and updated on every run. |
AUS_RULES |
For most systems ‘1’ is the value you should use. If in doubt, use this. |
You’ll notice in the description for AUS_RULES=1 that AUS does not automatically generate any distributions on non-indexed columns. However it will maintain any existing distributions regardless of the value of AUTO_AUTO_RULES, even if AUS wouldn’t have created them if they didn’t exist.
Non-indexed columns and distributions
If you are migrating from dostats you will have MEDIUM mode distributions on all non-indexed columns. It’s possible to drop these on individual columns using:
UPDATE STATISTICS LOW FOR TABLE <tab name> (<colname>) DROP DISTRIBUTIONS ONLY;
I’d test before doing anything like this so you may just choose to leave them in and let AUS maintain them. If UPDATE STATISTICS MEDIUM doesn’t take very long on your system this is probably the best choice.
Are these distributions important? My answer is that it depends but, more often than not, no. The Informix Performance Guide recommends you have them but this is just a general recommendation. At this point it’s important not to lose sight of the fact that the goal is not to have the most comprehensive, high-resolution and up to date statistics possible; it is to ensure you can guarantee that your queries always run efficiently. Given that we can’t be updating statistics and distributions on all columns all of the time some compromises are inevitable.
Often when running a query the main risk is the optimizer choosing to use sequential scans instead of an index. This risk is greatly reduced, if not eliminated, if the onconfig parameter OPTCOMPIND is set 0. The downside of this is that the optimizer won’t select a sequential scan when it is the best query plan available unless there are no other options.
In general distributions on a column are more useful if there is some skew in the data. However be aware that for non-indexed columns syscolumns.colmin and syscolumns.colmax are never populated by UPDATE STATISTICS LOW so the optimizer is truly blind about the data ranges without a distribution.
I’m going to run through an example now using this table:
people table | indices | |||
---|---|---|---|---|
person_id | serial | not null | ix_people1 | (person_id) |
name | varchar(254) | not null | ix_people2 | (age) |
gender | char(1) | not null | ||
age | int | not null |
The table will be populated with random ersatz data as follows:
- 1 million rows.
- Ages evenly distributed between 5 and 84.
- 75% female, 25% male.
I’ll be running this query:
SELECT
name
FROM
people
WHERE
age BETWEEN 18 AND ? AND
gender=?
And tweaking these parameters:
- Upper age limit in the query.
- Gender in the query.
- Whether I have a medium mode distribution on gender.
- The value of OPTCOMPIND in the onconfig.
My results were as follows:
Upper age | Gender | Actual rows returned | Medium mode distribution on gender? | OPTCOMPIND | Estimated rows | Estimated cost | Plan |
---|---|---|---|---|---|---|---|
25 | f | 74817 | No | 0 | 9988 | 21104 | Index path |
2 | 21104 | Index path | |||||
Yes | 0 | 75539 | 21104 | Index path | |||
2 | 21104 | Index path | |||||
25 | m | 25061 | No | 0 | 9988 | 21104 | Index path |
2 | 21104 | Index path | |||||
Yes | 0 | 24539 | 21104 | Index path | |||
2 | 21104 | Index path | |||||
30 | f | 121748 | No | 0 | 16232 | 38477 | Index path |
2 | 33923 | Sequential scan | |||||
Yes | 0 | 122439 | 38477 | Index path | |||
2 | 33923 | Sequential scan | |||||
30 | m | 40572 | No | 0 | 16232 | 38477 | Index path |
2 | 33923 | Sequential scan | |||||
Yes | 0 | 39881 | 38477 | Index path | |||
2 | 33923 | Sequential scan |
What conclusions can we draw from this example?
- OPTCOMPIND was a determining factor, not the presence of a medium mode distribution on gender.
- Having the distribution gave a much better estimate of the number of rows returned.
- The optimizer never used a different plan for the male or female queries.
Of course this is one example and you may have some different ones.
Columns with incrementing values
Let’s illustrate a different point with another example.
Maintaining distributions on any sort of log or journal where there is a timestamp field can be a problem. The highest value in your distribution is wrong almost immediately after calculating it because new rows are being added all the time with later timestamps. This means that if you do a query over recently added data your distributions may tell the optimizer that there’s no data.
ledger table | indices | |||
---|---|---|---|---|
line_id | serial | not null | ix_ledger1 | (line_id) |
account_id | int | not null | ix_ledger2 | (account_id, creation_time) |
creation_time | datetime year to second | not null | ix_ledger3 | (creation_time) |
ix_ledger4 | (account_id) |
for_processing table | indices | |||
---|---|---|---|---|
line_id | int | not null | ix_fp1 | (line_id) |
creation_time | datetime year to second | not null | ix_fp2 | (creation_time) |
Both of these tables have over 1 million rows and new rows being added continuously.
I am going to run this query:
SELECT
FIRST 5
l.line_id,
l.creation_time
FROM
ledger l,
for_processing f
WHERE
f.line_id = l.line_id AND
l.account_id = 50 AND
f.creation_time BETWEEN
'2015-02-02 17:00:00' and '2015-02-02 21:00:00';
There are two conceivable ways to run this query:
- Use the index on creation_time on for_processing, join to the ledger table on line_id and then filter on the account_id column.
- Use the index on account_id on the ledger table, join by line_id and then filter on the creation_time column.
The risk with the first one is that a lot of rows are read, only to eliminate the vast majority of them when the account_id criterion is applied.
The optimizer may prefer to drive off creation_date, particularly if the distribution indicates there are no data past the time the distribution was gathered. This is because it believes (wrongly) that selecting from a date range in the for_processing table where it believes there is no data at all, avoiding the need to do any work, is more efficient than selecting out an account from the ledger and then joining the for_processing table.
This can be very slow for large date ranges. This is particularly true when there are a large number of accounts.
Can AUS help you here? Not really, this example is more to point out a danger. The risk of running into a problem like this can be massively increased if you use the default STATCHANGE value of 10. This is because here it is the age of the distributions that matters, not how much the data has changed.
My recommendation is:
In your onconfig either:
set AUTO_STAT_MODE to 0.
Or set STATCHANGE to 0 if AUTO_STAT_MODE is 1.
If there are tables for which this is not appropriate, do so at the table level:
ALTER TABLE <tabname> STATCHANGE <percent>;
In my view restricting updates only when there have been absolutely no changes is the only safe way.
There is an undocumented feature that can help here:
IC91678: Informix optimizer vulnerable to poor query response on incrementing columns.
However the fix is not switched on unless you set onconfig parameter: SQL_DEF_CTRL 0x2. This can be switched on dynamically with onmode -wm. With this switched on, date/time distributions are effectively extended into the future by up to 30 days. While the answer to the question how old can my statistics and distributions be is nearly always it depends, with this switched on there is a hard limit.
The date or datetime column concerned must have a default value of TODAY or CURRENT. The code also compares the last timestamp in the distribution with the time the distribution was gathered. The two must be close together to activate the feature.
This fix also works on serial, serial8 and bigserial fields.
This feature is a little tricky to test because you must:
- Populate the tables with a large volume of data.
- Update the distributions.
- Add some more data with later timestamps.
- Wait a while!
Here are my results:
Date range | |||
---|---|---|---|
Within the distribution bounds | Beyond the distribution upper bound | Beyond the distribution upper bound with IC91678 | |
Query drives off | account_id | creation_time | account_id |
Costs: lead with account_id | 264 | 398 | 398 |
Costs: lead with creation_time | 79004 | 4 | 398 |
Estimated rows | 20 | 1 | 1 |
Actual rows | 10 | 1 | 1 |
Wildcard queries and insufficient distribution resolution
Another problem you may have with the standard distributions created by UPDATE STATISTICS HIGH is insufficient resolution. By default the data are divided into 200 equal-sized buckets (resolution 0.5) and this may not suffice for some wildcard queries. The optimizer may be grossly wrong in its estimates for the number of rows returned and this can be improved by increasing the number of buckets.
customer table | indices | |||
---|---|---|---|---|
customer_id | serial | not null | ix_customer1 | (customer_id) |
user_name | varchar(50) | not null | ix_customer2 | (user_name) |
Again this table will be populated with random ersatz data as follows:
- 9 million rows.
- This gives a distribution bucket size of 45000 with 0.5 resolution.
And my query:
SELECT
FIRST 1000
customer_id,
user_name
FROM
customer
WHERE
user_name
LIKE 'BAN%'
ORDER BY
customer_id;
Look at the query carefully. It’s the sort of query you might get from a web customer search form. The ORDER BY on customer_id is important because it gives the engine the option of avoiding sorting any data if this index is used to select the data. If the optimizer thinks the user_name criterion is not very selective, i.e. there are a lot more than 1000 customers whose user name begins with the letters BAN and it will find them quickly without reading many rows, it may prefer this query plan.
There are two credible plans for this query:
- Use the index on user_name, read all the rows it points to and then sort on customer_id, return the first 1000.
- Use the unique index on customer_id, read the rows it points to, filter, stop at 1000 rows. This plan does not require a sort.
There is a third option of scanning the entire table but it’s unlikely the optimizer will choose this.
Let’s look at the results of running this query with a standard HIGH mode distribution on customer_name, a second distribution with ten times the resolution and no distribution at all.
Distribution | |||
---|---|---|---|
Variable | HIGH 0.5 resolution | HIGH 0.05 resolution | No distributions |
Selected index (no directive) | customer_id | customer_name | customer_id |
Costs: lead with customer_id | 3480 | 313249 | 174 |
Costs: lead with user_name | 567525 | 1214 | 12435040 |
Estimated rows | 90000 | 1000 | 1800000 |
Actual rows | 1028 | 1028 | 1028 |
This example is real but also carefully selected because different three letter combinations may give different and more sane results, even with the standard distribution resolution.
If instead I run this query:
SELECT
FIRST 1000
customer_id,
user_name
FROM
customer
WHERE
user_name
LIKE 'TDM%'
ORDER BY
customer_id;
I get:
Distribution | |||
---|---|---|---|
Variable | HIGH 0.5 resolution | HIGH 0.05 resolution | No distributions |
Selected index (no directive) | customer_name | customer_name | customer_id |
Costs: lead with customer_id | 135723 | 313249 | 174 |
Costs: lead with user_name | 2909 | 1214 | 12435040 |
Estimated rows | 2308 | 1000 | 1800000 |
Actual rows | 801 | 801 | 801 |
AUS makes maintaining custom resolutions (or confidence levels with UPDATE STATISTICS MEDIUM) very easy. Simply run UPDATE STATISTICS manually to gather the desired distribution. AUS will maintain this resolution for you.
How frequently should I run UPDATE STATISTICS?
A common question a DBA may ask is how often is it necessary to UPDATE STATISTICS? A straightforward if unhelpful answer is often enough to ensure efficient query plans. More specifically:
- On tables without columns with incrementing values, by which I mean that new values lie inside the range of the existing data, or where these columns are not used in the selection criteria for queries, it may be safe to use STATCHANGE (with AUTO_STAT_MODE) to regulate the frequency based on percentage change at the table level.
- With incrementing values the working set can quickly get beyond the min/max values in the statistics or distributions. Here I’d recommend being much more aggressive and update based on age, regardless of the volume of change. This especially applies to your distributions.
If your system is small enough that you can run a full UPDATE STATISTICS every day there is no harm in doing this. It is probably overkill but it is one way of playing safe. To do this set AUS_CHANGE to 0 and make sure both the scheduler and the evaluator run daily. For larger systems you do need to be more selective about how often you run UPDATE STATISTICS.
Monitoring AUS
Does AUS just work? Well yes, it was aimed at the embedded market and it is pretty robust. On a larger system there is more to go wrong and so I’d recommend you check that:
- the parameters are set correctly (in case they change if you rebuild the sysadmin database, for example).
- there are no errors running UPDATE STATISTICS.
- all statistics and distributions are sufficiently up to date.
- the window you give the refresh job to run in is long enough.
Check | Notes | Correct parameters |
|
---|---|
No errors |
|
Statistics and distributions sufficiently up to date |
|
Long enough window in which to run commands |
|
UPDATE STATISTICS FOR PROCEDURE
How does AUS deal with stored procedure plans as stored in sysprocplan? Well it doesn’t directly and does not call UPDATE STATISTICS FOR PROCEDURE [procname]
.
My take on this is that routines referencing updated tables will be recompiled anyway the first time they are called immediately after running UPDATE STATISTICS. On a busy OLTP system this will probably happen before you have chance to update procedure plans manually. If you do have reason to do this, you will need to do it manually and if you do don’t set PDQPRIORITY.
If your system does have dead time there may be a small advantage to running this but I don’t think it really matters that much.
Method for running UPDATE STATISTICS [LOW]
Let’s now look at how AUS actually calls UPDATE STATISTICS. As discussed earlier the evaluation task creates a list of commands to run and these are run by the refresh task exactly as you see them when you query the aus_command table.
Let’s start by considering UPDATE STATISTCS [LOW].
AUS simply calls UPDATE STATISTICS [LOW] as one command without specifying any column names and I have seen it suggested that this is a bad thing (I don’t agree). The popular alternative, dostats, runs UPDATE STATISTICS LOW separately for each index key. Performance-wise there is not as much difference as you might expect, I suspect this is because data tend to be in the buffer cache on repeated calls. But is there any difference to the end result?
Using Informix 11.70.FC7W1 I performed the following test:
- Created two identical empty tables with twelve indices and loaded both with the same data.
- Ran AUS on one, dostats on the other.
- Unloaded and compared systables, syscolumns and sysindices for both tables.
Apart from tabid, tabname, index names and build times, the unloads were identical. My conclusion is that is no difference and AUS is slightly faster because it is done with one command in a single pass.
Performance of UPDATE STATISTICS [LOW]
In terms of performance there is actually very little you can do to influence the speed of UPDATE STATISTICS LOW. The parameters DBUPSPACE, PSORT_NPROCS and even PDQPRIORITY have no effect.
Setting USTLOW_SAMPLE in the onconfig file is the only performance optimisation available apart from general server tuning.
It is supposed to be possible to get parallel scan threads for all fragmented indices when PDQPRIORITY is set to at least 10 but I can’t reproduce this. This only works with USTLOW_SAMPLE is switched off.
Performance of UPDATE STATISTICS MEDIUM or HIGH
For UPDATE STATISTICS HIGH and MEDIUM there are a few parameters that can influence performance and with the exception of PDQPRIORITY you can’t use any of them with AUS. In summary they are:
- DBUPSPACE: three parameters in one controlling four things:
- how much memory and disk are available for sorting when PDQPRIORITY is not set.
- whether to use indices for sorting.
- whether to print the explain plan.
- DBSPACETEMP: overrides default onconfig value.
- PSORT_DBTEMP: allows you to specify a filesystem instead of DBSPACETEMP.
- PSORT_NPROCS: specifies number of threads for sorting.
Having done tests with this parameter:
- The disk and memory parameters don’t have any effect if you set PDQPRIORITY. If you do up to DS_TOTAL_MEMORY multiplied by the effective PDQPRIORITY (as a percentage) can be allocated.
- Even if set, they only have the effect of restricting the number of columns that can be processed in a single pass so a lot of the time they make no difference. For a better understanding of how this works see John Miller III’s article on Understanding and Tuning Update Statistics.
- Setting the explain output via the directive setting works a bit weirdly: you need to run set explain on separately to enable the explain plan even if set here but disabling it with this parameter does work.
- The default is to use indices for sorting. Switching this off (directive 1 or 2) was up to 5x slower on my test system.
Here are some performance tests showing the effects of different DBUPSPACE settings on UPDATE STATISTICS HIGH. My performance tests are all carried out on a powerful 32 core Linux x86_64 server with 256 Gb RAM and fast SAN storage (not my laptop). For my tests I did a dummy run first to try and even out any effects of caching in the SAN. I then did a run immediately after initialising my instance (cold) and a repeat run (warm) for each result.
10 million row non-partitioned table with 12 single-column indices:
DBUPSPACE | PDQPRIORITY | Memory allocated (Mb) | No. of scans | Indices used for sorting? | Light scans used? | Time elapsed cold (s) | Time elapsed warm (s) |
---|---|---|---|---|---|---|---|
1024:15:0 (default) | 0 | 15 | 12 | Yes | No | 111 | 97 |
0:50:0 | 0 | 50 | 12 | Yes | No | 108 | 97 |
0:50:1 | 0 | 50 | 12 | No | No | 259 | 254 |
0:50:0 | 100 | 2462.7 | 1 | Yes | Yes | 172 | 180 |
0:50:1 | 100 | 2462.7 | 1 | No | Yes | 176 | 172 |
74 million row table with 8-way round-robin partitioning and 3 single-column indices:
DBUPSPACE | PDQPRIORITY | Memory allocated (Mb) | No. of scans | Indices used for sorting? | Light scans used? | Time elapsed cold (s) | Time elapsed warm (s) |
---|---|---|---|---|---|---|---|
1024:15:0 (default) | 0 | 15 | 3 | Yes | No | 227 | 169 |
0:50:0 | 0 | 50 | 3 | Yes | No | 224 | 168 |
0:50:1 | 0 | 50 | 3 | No | No | 501 | 494 |
0:50:0 | 100 | 3064.7 | 1 | Yes | Yes | 432 | 426 |
0:50:1 | 100 | 3064.7 | 1 | No | Yes | 425 | 428 |
What conclusions can we draw from these results?
- Using PDQPRIORITY actually makes it slower.
- There is no significant difference between using 15 Mb of memory for sorts and 50 Mb of memory for sorts. I suspect this is because the number of scans is the same.
- Using indices for sorting (the default) is faster than not using indices for sorting when not using PDQ.
- The use of light scans (avoiding the buffer cache) reduces the variation between the cold and warm results.
Despite having a partitioned table and PDQPRIORITY set, the interesting thing here is that during the execution I observed no parallelism. You can see this for yourself by identifying the session running UPDATE STATISTICS and looking at the threads the session is using. I get something like this for my partitioned table:
IBM Informix Dynamic Server Version 11.70.FC7W1 -- On-Line -- Up 00:01:11 -- 228250944 Kbytes session effective #RSAM total used dynamic id user user tty pid hostname threads memory memory explain 14 informix - - 28798 guebobdb 1 11563008 10603832 off Program : /opt/informix-11.70.FC7W1/bin/dbaccess tid name rstcb flags curstk status 201 sqlexec 26e1fbbfc8 ---PR-- 24752 running- Memory pools count 4 name class addr totalsize freesize #allocfrag #freefrag 14 V 26e4c59040 139264 10008 176 12 14_SORT_0 V 26e573d040 4108288 294896 34865 4 14_SORT_1 V 26e5748040 3657728 326176 32934 4 14_SORT_2 V 26e5730040 3657728 325936 32937 4 name free used name free used overhead 0 13152 scb 0 144 opentable 0 10640 filetable 0 3016 ru 0 600 misc 0 160 log 0 16536 temprec 0 21664 blob 0 832 keys 0 664 ralloc 0 19320 gentcb 0 1592 ostcb 0 2944 sort 0 1577072 sqscb 0 20224 sql 0 72 srtmembuf 0 817320 hashfiletab 0 552 osenv 0 3000 buft_buffer 0 8736 sqtcb 0 9680 fragman 0 3640 shmblklist 0 8074432 sqscb info scb sqscb optofc pdqpriority optcompind directives e28f91c0 e4c5a028 0 100 0 1 Sess SQL Current Iso Lock SQL ISAM F.E. Id Stmt type Database Lvl Mode ERR ERR Vers Explain 14 UPDATE STATIST testdb DR Wait 15 0 0 9.24 On Current SQL statement (6) : update statistics high for table large_8-way_table (id_col1,id_col2,date_col) distributions only force Last parsed SQL statement : update statistics high for table large_8-way_table (id_col1,id_col2,date_col) distributions only force
In the above example we can see that the engine has allocated three memory pools, one each for the three columns we are updating in a single pass with PDQPRIORITY set.
If you do see parallel sort (PSORT) threads like this:
Program : /opt/informix-11.70.FC7W1/bin/dbaccess tid name rstcb flags curstk status 370 sqlexec 26e1fbaf28 ---PR-- 20688 running- 371 psortpro 26e1fde468 Y------ 912 cond wait incore - 372 psortpro 26e1fcda68 Y------ 912 cond wait incore - 373 psortpro 26e1fd88f8 Y------ 912 cond wait incore - 374 psortpro 26e1fd1498 Y------ 912 cond wait incore - 375 psortpro 26e1fc6608 Y------ 912 cond wait incore - 376 psortpro 26e1fdc328 Y------ 912 cond wait incore - 377 psortpro 26e1fd4ec8 Y------ 912 cond wait incore - 378 psortpro 26e1fca038 Y------ 912 cond wait incore - 379 psortpro 26e1fbe958 Y------ 912 cond wait incore - 380 psortpro 26e1fdbad8 Y------ 912 cond wait incore - 381 psortpro 26e1fcb0d8 Y------ 912 cond wait incore - 382 psortpro 26e1fda1e8 Y------ 912 cond wait incore - 383 psortpro 26e1fc97e8 Y------ 912 cond wait incore - 384 psortpro 26e1fd5718 Y------ 912 cond wait incore - 385 psortpro 26e1fcd218 Y------ 912 cond wait incore - 386 psortpro 26e1fc2388 Y------ 912 cond wait incore - 387 psortpro 26e1fd80a8 Y------ 912 cond wait incore - 388 psortpro 26e1fd0c48 Y------ 912 cond wait incore - 389 psortpro 26e1fc5db8 ------- 1632 running- 390 psortpro 26e1fdf508 Y------ 912 cond wait incore - 391 psortpro 26e1fd7858 Y------ 912 cond wait incore - 392 psortpro 26e1fc6e58 Y------ 912 cond wait incore - 393 psortpro 26e1fd35d8 Y------ 912 cond wait incore - 394 psortpro 26e1fdcb78 ------- 8 running- 395 psortpro 26e1fca888 ------- 1632 running- 396 psortpro 26e1fbf9f8 ------- 1632 running- 397 psortpro 26e1fcc9c8 ------- 8 running- 398 psortpro 26e1fc1b38 ------- 8 running- 399 psortpro 26e1fd1ce8 ------- 8 running- Memory pools count 4 name class addr totalsize freesize #allocfrag #freefrag 41 V 270a6cd040 954368 61560 438 53 41_SORT_0 V 2706fc2040 3915776 2184 33415 10 41_SORT_1 V 2706fbb040 4222976 14488 31589 27 41_SORT_2 V 2706fbc040 4222976 14248 31592 27
It is either because PSORT_NPROCS is set in your environment or set in the database engine’s environment when it was started.
Let’s now looking at the effect of PSORT_NPROCS. The only way to use this with AUS is to set PSORT_NPROCS when you start the database engine, which will of course affect all sessions.
Setting PSORT_NPROCS is the only way to get any parallel processing with UPDATE STATISTICS HIGH or MEDIUM. It has no effect on UPDATE STATISTICS LOW. Setting PDQPRIORITY only provides more memory and allows HIGH and MEDIUM mode distributions on multiple columns to be built in a single pass, if enough memory is available. There will be one SORT memory pool per column being processed regardless of PDQPRIORITY. Sorting with PSORT_NPROCS set can be faster as we’ll see now.
Carrying on with the same examplse as above, I get these results:
10 million row non-partitioned table with 12 single-column indices:
DBUPSPACE | PDQPRIORITY | PSORT_ NPROCS | Memory allocated (Mb) | No. of scans | Light scans used? | Time elapsed cold (s) | Time elapsed warm (s) |
---|---|---|---|---|---|---|---|
0:50:0 | 0 | Not set | 50 | 12 | No | 108 | 97 |
0:50:0 | 0 | 24 | 50 | 12 | No | 123 | 97 |
0:50:0 | 100 | Not set | 2462.7 | 1 | Yes | 172 | 180 |
0:50:0 | 100 | 24 | 2538.4 | 1 | Yes | 80 | 83 |
74 million row table with 8-way round-robin partitioning and 3 single-column indices:
DBUPSPACE | PDQPRIORITY | PSORT_ NPROCS | Memory allocated (Mb) | No. of scans | Light scans used? | Time elapsed cold (s) | Time elapsed warm (s) |
---|---|---|---|---|---|---|---|
0:50:0 | 0 | Not set | 50 | 3 | No | 224 | 168 |
0:50:0 | 0 | 24 | 50 | 3 | No | 223 | 170 |
0:50:0 | 100 | Not set | 3064.7 | 1 | Yes | 432 | 426 |
0:50:0 | 100 | 24 | 3083.7 | 1 | Yes | 156 | 145 |
What conclusions can we draw this time?
- The fastest UPDATE STATISTICS HIGH (or MEDIUM) performance is with PDQ and PSORT_NPROCS set. (Seasoned DBAs might have expected this result).
- But it’s not much faster than running without either of these parameters set, probably with fewer server resources.
- It’s worth bearing in mind that PDQ enables light scans which may avoid buffer cache churn.
Adding additional AUS Refresh tasks for greater throughput
There is another way to achieve parallelism with AUS and that is to add additional scheduler tasks so that more than one table can be worked on at once, for example:
INSERT INTO
sysadmin:ph_task (
tk_name,
tk_description,
tk_type,
tk_dbs,
tk_execute,
tk_delete,
tk_start_time,
tk_stop_time,
tk_frequency,
tk_monday,
tk_tuesday,
tk_wednesday,
tk_thursday,
tk_friday,
tk_saturday,
tk_sunday,
tk_group,
tk_enable
)
VALUES (
'Auto Update Statistics Refresh #2',
'Refreshes the statistics and distributions which were recommended by the evaluator.',
'TASK',
'sysadmin',
'aus_refresh_stats',
'0 01:00:00',
'03:11:00',
'12:45:00',
'1 00:00:00',
't',
't',
't',
't',
't',
'f',
'f',
'PERFORMANCE',
't'
);
It is vital that the task name begins with Auto Update Statistics Refresh as shown here otherwise some of the internal code that stops the evaluator from running at the same time as the refresh tasks will not work.
Think of it as a batch process where the goal is to update all your statistics and distributions, not doing an individual table as fast as possible.
I recommend this method if you need extra grunt! Then run without PDQ (AUS_PDQ = 0).
Pros and cons of using PSORT_DBTEMP
Let’s move to consider another environment variable you can’t easily use: PSORT_DBTEMP. Is using a filesystem for sorts faster than temporary dbspaces? For this I am not going to do any performance tests largely because a comparison between the local SATA disks used for filesystems and the battery-backed, part solid state SAN used for the database storage on my server is not a fair fight.
If you want to use PSORT_DBTEMP with AUS, again you will need to set it in your environment when initialising the server and use it across your system.
The only definitive benefit of pointing to a filesystem using PSORT_DBTEMP instead of using DBSPACETEMP is that the Linux filesystem cache has an effect. This means that your temporary files may never be committed to disk, giving you a performance advantage. Another interesting alternative is to use a RAM disk.
Otherwise when considering whether to use a filesystem over temporary dbspaces, I would consider your hardware.
Recent UPDATE STATISTICS defects
Below are some defects relating to UPDATE STATISTICS that I have personally logged with IBM. All are general UPDATE STATISTICS issues and whether you use AUS or not does not affect your chances of hitting these. I have written some comments underneath.
- IT06767 UPDATE STATISTICS HIGH ON TABLE BIGGER THAN 1.1 BIO ROWS CAN CAUSE EXCEPTION IN SRTQUICK
- This defect can only be hit if you use PDQ and a large amount of memory is allocated.
- IT06726 Assert Warning Invalid index statistics found when using statistics sampling with index having many deleted items
- Requires USTLOW_SAMPLE to be switched on. It is fairly harmless but does write some garbage to sysfragments. Re-running UPDATE STATISTICS LOW usually fixes it.
- IT02679 UPDATE STATISTICS HIGH ON A FRAGMENTED TABLE LEADS TO ERROR MESSAGE 9810
- This can only be seen if using fragment level statistics.
- IT05463 Fragment based update stats high consumes a huge amount of sblobspace
- This can only be seen if using fragment level statistics. It is not worth trying to make your sbspace extremely large to work around this problem.
- IT05639 -768: Internal error in routine mbmerge2bin when running update statistics on a fragmented table (running fragment-level statistics)
- Another fragment level statistics one. With the fixes for IT02679, IT05463 and IT05639 this feature is solid. I wouldn’t enable this for a table unless you have all of these in your version.
Most of these defects are fixed in 12.10.FC5.
UPDATE STATISTICS performance summary
Overall then what seems to be the best balance between speed and use of resources on your system is:
- In terms of AUS, leave AUS_PDQ at 0.
- If you need more throughput add more Refresh tasks to the scheduler.
- That’s it.
- Set DBUPSPACE=0:50:0 in the server environment.
I hope this has been useful. I’ll end with a quick summary of some of the more interesting performance points from this presentation which apply regardless of whether you use AUS or not:
- Setting PDQPRIORITY only provides more memory for UPDATE STATISTICS HIGH or MEDIUM and does not provide parallel processing. It may even make it run slower.
- If you want multi-threaded processing, this only works by setting PSORT_NPROCS and then only with UPDATE STATISTICS HIGH or MEDIUM. Because it does not require PDQ you can use this with workgroup edition.
- In my tests only when used together do PDQPRIORITY and PSORT_NPROCS improve performance. PDQ does switch on light scans though which avoid churning your buffer cache..
- Not using indices for sorting for UPDATE STATISTICS HIGH or MEDIUM can be significantly slower.
- The performance of UPDATE STATISTICS [LOW] can be improved by setting USTLOW_SAMPLE.