1 – Which data is used by the DB2 Optimizer and which is updated by RUNSTATS?
2 – Which default column values trigger the DB2 Optimizer to use its own internal default values?
Every now and again, I hold a little presentation called Are you a RUNSTATS Master? where I describe in detail, what the DB2 Optimizer uses for access path selection in relation to the DB2 catalog data.
I am always surprised at how often people say “just that data?” or “Is that it?” (the various other reasons for access path selection like CP speed, Number of CPs, RID Pool size, Sort Pool size, Max data caching size, and, of course, the 80 bufferpools are also mentioned, but these have nothing to do with RUNSTATS).
So generally the answer is “Yes”, however the permutations and combinations make the devil in the detail – The DB2 Optimizer’s algorithms are top secret, but the input data it uses is fully described in the documentation.
What I want to do this month is show the Catalog data that is used, the default values that can cause surprising things to happen and the problem of correlations in the catalog.
First is – Which data is used by the DB2 Optimizer and which is updated by RUNSTATS?
Here is a complete list of the eleven tables used by the DB2 Optimizer:
- SYSIBM.SYSCOLDIST
- SYSIBM.SYSCOLSTATS *
- SYSIBM.SYSCOLUMNS
- SYSIBM.SYSINDEXES
- SYSIBM.SYSINDEXPART
- SYSIBM.SYSKEYTARGETS 9 and up (same as SYSCOLUMNS)
- SYSIBM.SYSKEYTGTDIST 9 and up (same as SYSCOLDIST)
- SYSIBM.SYSROUTINES
- SYSIBM.SYSTABLES
- SYSIBM.SYSTABLESPACE
- SYSIBM.SYSTABSTATS
* degree of parallelism only and, after APAR PK62804, also „sometimes“ used to bound filter factor estimates…
Now we can also list out all of the columns (obviously not including the key columns) which are used by the DB2 Optimizer:
SYSCOLDIST
CARDF, COLGROUPCOLNO, COLVALUE, FREQUENCYF, HIGHVALUE, LOWVALUE, NUMCOLUMNS, QUANTILENO, STATSTIME
SYSCOLSTATS
COLCARD, HIGHKEY, LOWKEY
SYSCOLUMNS
COLCARDF, HIGH2KEY, LOW2KEY
SYSINDEXES
CLUSTERING*, CLUSTERRATIO, CLUSTERRATIOF, DATAREPEATFACTORF, FIRSTKEYCARDF, FULLKEYCARDF, NLEAF, NLEVELS
SYSINDEXPART
LIMITKEY*
SYSKEYTARGETS
CARDF, HIGH2KEY, LOW2KEY, STATS_FORMAT
SYSKEYTGTDIST
CARDF, KEYGROUPKEYNO, KEYVALUE, FREQUENCYF, HIGHVALUE, LOWVALUE, NUMKEYS, QUANTILENO, STATSTIME
SYSROUTINES
CARDINALITY*, INITIAL_INSTS*, INITIAL_IOS*, INSTS_PER_INVOC*, IOS_PER_INVOC*
SYSTABLES
CARDF, EDPROC*, NPAGES, NPAGESF, PCTROWCOMP
SYSTABLESPACE
NACTIVE, NACTIVEF
SYSTABSTATS
CARD, CARDF, NPAGES
Notes: * Columns are not updated by RUNSTATS and _ Columns are not updatable at all. The column STATSTIME is used only if there are duplicates in the SYSCOLDIST table, and then the DB2 Optimizer will use the “newer” data that was probably inserted by a User.
Second is – Which default column values trigger the DB2 Optimizer to use its own internal default values?
SYSCOLUMNS | |
If COLCARDF | = -1 then use 25 |
SYSINDEXES | |
If CLUSTERRATIOF | <= 0 then use CLUSTERRATIO |
If CLUSTERRATIO | <= 0 then use 0.95 if the index is CLUSTERing = ‘Y’ otherwise 0.00 |
If FIRSTKEYCARDF | = -1 then use 25 |
If FULLKEYCARDF | = -1 then use 25 |
If NLEAF | = -1 then use 33 (Which is SYSTABLES.CARDF / 300) |
If NLEVELS | = -1 then use 2 |
SYSROUTINES | |
If CARDINALITY | = -1 then use 10,000 |
If INITIAL_INSTS | = -1 then use 40,000 |
If INITIAL_IOS | = -1 then use 0 |
If INSTS_PER_INVOC | = -1 then use 4,000 |
If IOS_PER_INVOC | = -1 then use 0 |
SYSTABLES | |
If CARDF | = -1 then use 10,000 |
If NPAGESF | <= 0 then use NPAGES |
If NPAGES | = -1 then use 501 (Which is CEILING (1 + SYSTABLES.CARDF / 20)) |
SYSTABLESPACE | |
If NACTIVEF | <= 0 then use NACTIVE |
If NACTIVE | <= 0 then use 501 (Which is CEILING (1 + SYSTABLES.CARDF / 20)) |
SYSTABSTATS | |
If CARDF | = -1 then use SYSTABSTATS.CARD |
If CARD | = -1 then use 10,000 |
If NPAGES | = -1 then use 501 (Which is CEILING (1 + SYSTABSTATS.CARDF / 20)) |
So now you can see that non-floating point “old” data, may still be used today and then causes access path headaches!
Now to top it all, the data in the SYSCOLDIST and SYSKEYTGTDIST never gets simply “deleted”. Once that data is inserted it stays there, until it is overwritten by new data or the object is dropped. This all leads to some very old data in these two tables that can and does cause the DB2 Optimizer a ton of grief! One of the first things I do is select the MIN(STATSTIME) from these tables just to see how old the data really is. Do it yourself and be surprised! I have seen sites with eight years old data in the SYSCOLDIST and that cannot be good!
Finally now onto correlations… There are lots of little tricks that DBAs use to massage access path choice and one of these is to just set NLEVELS to 15 for a given index. Then lots of queries simply refuse to touch it as it would appear to be HUGE. Now just simply updating columns can cause the DB2 Optimizer, in the best case to ignore your updates or perhaps makes things even worse! So here is a list of the correlations (In other words, if you change xxx remember to change yyy and zzz as well):
- Relationships exist among certain columns of certain tables:
- Columns within SYSCOLUMNS
- Columns in the tables SYSCOLUMNS and SYSINDEXES
- Columns in the tables SYSCOLUMNS and SYSCOLDIST
- Columns in the tables SYSCOLUMNS, SYSCOLDIST, and SYSINDEXES
- If you plan to update some values, keep in mind the following correlations:
- COLCARDF and FIRSTKEYCARDF/FULLKEYCARDF (They must be equal for the 1st column and full, if a single column index)
- COLCARDF, LOW2KEY and HIGH2KEY. (For non-default COLCARDF LOW2KEY and HIGH2KEY key must be filled with data) and if the COLCARDF is 1 or 2 DB2 uses LOW2KEY and HIGH2KEY as domain statistics to generate frequencies.
- CARDF in SYSCOLDIST. CARDF is related to COLCARDF and FIRSTKEYCARDF and FULLKEYCARDF. It must be at a minimum
- A value between FIRSTKEYCARDF and FULLKEYCARDF if the index contains the same set of columns
- A value between MAX(colcardf of each col) and the product of all the columns COLCARDFs in the group
- CARDF in SYSTABLES. CARDF must be equal or larger than any other cardinalities, such as COLCARDF, FIRSTKEYCARDF, FULLKEYCARDF, and CARDF in SYSCOLDIST
- FREQUENCYF and COLCARDF or CARDF. The number of frequencies collected must be less than or equal to COLCARDF for the column or CARDF for the column group
- FREQUENCYF. The sum of frequencies collected for a column or column group must be less than or equal to 1
Do not forget that our little Freeware tool StatisticsHealthCheck will find all bad correlations, old data and badly updated data for you and it is FREE!
So I hope this little round-up of Catalog Statistics data was interesting, and, as usual, if you have any comments or questions, then please, feel free to mail me!