For a while now, I have been looking to get beyond tuning the database instances I look after at the instance level by doing activities such as monitoring I/O queues and cache hit rates and aiming to eliminate bottlenecks by spreading I/O evenly around a set of disks. It’s not that these things aren’t important; it’s just that they’re only one half of the picture and perhaps the smaller half: application performance and design are at least as important. For this I need to start looking at what SQL statements are running on the instance, how often and whether they efficient.
Query plan stability is another area I am interested in. If a SQL query plan changes, particularly if it changes for the worse and general performance suffers, how can I detect this? How can I be sure it was “better” before?
Two key questions I therefore want to answer are:
- What are the “top” SQL statements being run?
- What are the query plans being used by the engine for these statements and have these plans changed?
To do this I need to take samples of database activity and store them.
The definition of a “top” SQL statement is debatable but in an Informix context a useful definition could be the statements that spend the most time with active threads. This should capture quick frequently-run statements as well as report queries. We can then get more sophisticated and look at what these threads are doing and what – if anything – they’re waiting on.
I looked around at some of the various tools available in the market place and they all seemed to be able to look at tuning at the instance level, something I think DBAs are well-catered for with Informix, but few can do what I want. For example, I use Sentinel from AGS to draw real-time graphs showing thread, CPU, read and write activity but it doesn’t tell me much about what is causing this activity. Some tools can use SQL Trace to record all statements run in a session but the application I look after has literally hundreds of database connections so this is not very useful. Other vendors implement diagnostics into their application but often such data isn’t easily accessible to a DBA.
What I am looking for looks a little like Oracle AWR or Statspack. These are vendor-supplied solutions with close integration into Oracle. So could I come with a DBA home-brew solution for Informix to give me what I want? It appears the answer is a fairly unqualified “yes”.
The principles behind this approach are:
- At frequent intervals, look at what user-related threads are active within my instance using onstat -u.
- Record the detail about the thread state, e.g. any waits and whether the thread is reading.
- Look at what SQL is being run with onstat -g sql.
- After a significant period has elapsed (10 minutes), summarise this information and record it in a separate database.
- Use the SET EXPLAIN ON AVOID_EXECUTE directive and parse all the SQL statements collected, recording the costs, number of rows returned and query plan. Write these to the separate database also.
- Repeat the exercise ad infinitum.
Just three commands are being used: onstat -u, onstat -g sql and SET EXPLAIN ON AVOID_EXECUTE. The impact of this monitoring is low.
To find the active threads I’m interested in I use:
onstat -u | grep -vE ' Y[-*]|[R-][BCDFM] |^IBM Informix|^Userthreads|^address|maximum concurrent'
I also found I needed to occasionally run onstat -g ath to identify system threads like RSS replication send and receive queues and exclude these as well.
The storage requirement for all this is also fairly low: the magic of a relational database means we only need to record the SQL and corresponding query plans once. Most applications will only have a few thousand statements at most, although if your developers don’t use bind variables much, this can multiply up.
All that is really left is to report on it. It’s now fairly trivial to set up the following reports:
- “Top” SQL within a snapshot or series of adjacent 10-minute windows by frequency or based on certain wait events.
- Changed query plans complete with the explain output. Separately, I also log data distributions with dbschema -hd before and after gathering stats which can help explain why these changes have occurred.
I am now in a position so that when a spike in I/O activity is picked up by Sentinel, I can quickly see what has caused it.
You might be thinking that it all sounds great but are concerned that there is a catch. Are there any limitations? Well, a few:
- Capturing the SQL depends on running onstat -g sql immediately after running onstat -u with the smallest delay possible.
- It’s possible the thread will move on. Careful coding and using a multi-threaded program to capture multiple SQL statements at once can reduce any delay. There really isn’t much we can do about this and it’s more of a problem for fast, low cost statements.
- Sampling large number of sessions at the same time with onstat -g sql might lock internal Informix structures.
- It might not as well but I’m a cautious DBA so I use a random throttling mechanism to keep the number of sessions sampled low even if a lot of active sessions are picked up by onstat -u.
- You can’t gather the query plans for sessions using temporary tables?
- It’s true, you can’t!
- The user generating the query plans has to have the necessary privileges to run the statement being parsed.
- This is a problem and to keep things “safe”, I use a read-only user and only generate query plans for select statements. Additionally my user can execute some functions that don’t make any changes to data.
- onstat -g sql mangles the SQL statements somewhat by removing new-lines.
- This is a big problem. If the code contains comments or hints starting with ‘–‘, e.g.
select * from atable -- this is a comment, it’s essential that there is a new-line after the comment or hint. My solution only works because the application developers have been disciplined in indenting code and I put a new-line in where there are several spaces in the SQL output. This means my code is not guaranteed to work in any database, although it does carry on after encountering SQL statements that don’t parse. There are also cases where it’s impossible to know for sure where the new-line should go.
- The plan being captured is not necessarily the one used.
- Ignoring marginal cases like where the statistics have been updated in between session capture and generating the explain plan, a far larger problem is prepared statements within an application. These may be prepared and executed using the original plan for days, even if update statistics has been run since the plan was generated. It’s only when the application is restarted that the SQL is re-parsed and there’s a possibility of a new plan.
My view is that it would be preferable if such monitoring was built into Informix by IBM and was more closely integrated with the engine than is possible using onstat to address the issues above.
I no longer need to say to my boss that we’ll have to wait for another occurrence to tell him what’s happening. My script is constantly gathering and storing data.
If you think this is as great as I do, then you may be wondering where the source code is. I don’t plan to release it, mainly because it would not be guaranteed to work with all systems owing to the way onstat -g sql strips out new-lines in its output.
I welcome any feedback on this. Please send this through my WordPress blog at informixdba.wordpress.com if you are reading this on an aggregator site like planetids.com.