DDHASH_STATS
============

For figuring out good values for DD_HASHSIZE and DD_HASHMAX.

Prints out statistics for 'onstat -g dic' as if DD_HASHSIZE was
set to a particular value.

This is based, more or less, on Andy Lennard's awk script in the
FAQ (6.11 - How do I configure DD_HASHMAX and DD_HASHSIZE?)

(C) Richard Harnden, 2007

Please report bugs: <richard.harnden@lineone.net>

Public Domain.  Use/copy at will.  No warranties, etc.

Building
========

$ gunzip ddhash_stats_1.0.tar.gz
$ tar xvf ddhash_stats_1.0.tar
$ cd ddhash_stats_1.0
$ make

(The Makefile assumes gcc)

The hash function
=================

The algorithm for hashing the table names has changed with
different versions of IDS.

*** If you use the wrong one, then all the stats will be wrong. ***

onstat -g dic | \
    ./ddhash_stats --detail --dd_hashsize <n> --print-table

should print the tables in the same buckets as onstat -g dic.

If it doesn't, then try adding
    --hash-old or --hash-very-old

I don't know when the hash function was changed, but:
    7.22.UC3 and 7.30UC7 need --hash-very-old
    9.30.UC2 needs --hash-old
    7.31.UD10 and 9.40.UC8 need --hash-new, this is the default.

Usage
=====

There are 3 modes ...

./ddhash_stats {--summary | -s} <args>
./ddhash_stats {--detail | -d} <args>
./ddhash_stats {--print-duplicate-keys | -k} 

Summary
=======

Prints a table, one DD_HASHMAX per line.

./ddhash_stats \
    [--dd_hashmax <n>] \
    [{ --prime | --nearly_prime | --composite }] \
    [--min <n>] [--max <n>]

Where:
    --dd_hashmax is the DD_HASHMAX you'd like.
        Keep this small, the default is 10.

    --prime - use only prime-numbers for DD_HASHSIZE
        This is the default.

    --nearly-prime - use composite numbers provided that they
        don't have any factors <= 13. [1]

    --composite - use all the numbers prime and composite.

    --min <n> - start with a DD_HASHSIZE of <n>.
        The default is <number of tables> / <dd_hashmax>.

    --max <n> - finish with a DD_HASHSIZE of <n>.
        The default is 4 * <number of tables> / <dd_hashmax>.

eg:

$ cat alice.txt | ./ddhash_stats --max 150
Size Used%  Min  Mid  Max Mean Stdev Skew Tavg 3avg Mode EstKB Over%
---- ----- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----- -----
  53 100.0    4 10.0   17  9.4   3.0 +0.5  9.3 10.2  8.0   871  44.2
  59 100.0    2  9.0   16  8.4   3.1 +0.3  8.2  9.2  8.5   884  35.5
  61 100.0    1  9.0   15  8.1   3.3 +0.0  8.0  9.2  8.0   886  31.0
  67 100.0    3  8.0   15  7.4   2.7 +0.6  7.5  8.0  7.0   901  17.7
  71 100.0    2  8.0   17  7.0   2.9 +0.7  7.0  7.8  7.2   900  15.9
  73 100.0    2  7.0   13  6.8   2.5 +0.5  6.8  7.5  5.0   906  14.5
  79 100.0    1  7.0   12  6.3   2.6 +0.2  6.2  7.0  5.0   909  13.9
  83 100.0    1  7.0   11  6.0   2.4 -0.2  6.0  7.0  7.0   917   2.2
  89  98.9    1  6.0   12  5.6   2.3 +0.3  5.6  6.2  5.0   915   4.6
  97  99.0    1  6.0   10  5.2   2.2 +0.3  5.2  6.0  4.0   918   0.0
 101  99.0    1  6.0   11  5.0   2.5 +0.7  4.8  6.0  3.0   914  11.1
 103  99.0    1  6.0   13  4.9   2.3 +0.4  5.1  6.0  5.0   916   2.6
 107  99.1    1  6.0   11  4.7   2.2 +0.3  4.8  5.8  4.5   918   2.2
 109 100.0    1  5.0   13  4.6   2.2 +1.0  4.7  5.0  4.3   914   5.0
 113 100.0    1  5.0   10  4.4   2.1 +0.5  4.5  5.2  4.0   919   0.0
 127  98.4    1  5.0   10  4.0   1.9 +0.5  4.2  5.0  4.0   919   0.0
 131  99.2    1  5.0    9  3.8   2.0 +0.6  4.1  4.8  3.0   920   0.0
 137  98.5    1  5.0    8  3.7   1.7 +0.3  3.9  4.8  3.0   920   0.0
 139  97.8    1  4.0   10  3.6   1.9 +0.7  4.0  4.2  4.0   920   0.0
 149  97.3    1  4.0   11  3.4   1.8 +1.0  3.7  4.0  2.0   919   2.2

Where:
    size is DD_HASHSIZE.
    Used% is the percentage of buckets that are actually used.
    Min is the length of the shortest list.
    Mid is the length of the median list.
    Max is the length of the longest list.
    Mean is the arithmetic mean of the used lists.
    Stdev is the standard deviation.
    Skew is the skew - +ve is skewed toward shorter lists [2].
    Tavg is the truncated mean.
    3avg is the tri-mean.
    Mode is the mode (or the mean of the modes if there are >1)
    EstKb is a guess. [3][4]
    Over% is the percentage of lists longer than dd_hashmax.

Detail
======

./ddhash_stats --detail \
    --dd_hashsize <n> \
    [--dd_hashmax <n>] \
    [--print-table] \
    [--print-distributions]

Where:
    --dd_hashsize <n> - use <n> for DD_HASHSIZE. Defaults to 31.
    --dd_hashmax is the DD_HASHMAX you'd like. Defaults to 10.
    --print-table - prints the tables in each bucket.
    --print-distributions - prints a histogram.

eg:

$ cat alice.txt | ./ddhash_stats --detail
DD_HASHSIZE              31
DD_HASHMAX               10

Number of tables        496

Estimated Memory (KB)   564..916

Buckets
=======
 Used      % Empty      % Total
----- ------ ----- ------ -----
   31 100.00     0   0.00    31

Lists
=====          <= DD_HASHMAX    > DD_HASHMAX           Total
             --------------- --------------- ---------------
Length Count # Tables      % # Tables      % # Tables      %
------ ----- -------- ------ -------- ------ -------- ------

     7     1        7   1.41        0   0.00        7   1.41
     9     2       18   3.63        0   0.00       18   3.63
------ ----- -------- ------ -------- ------ -------- ------
    <=     3       25   5.04        0   0.00       25   5.04

    11     1       10   2.02        1   0.20       11   2.22
    12     1       10   2.02        2   0.40       12   2.42
    13     4       40   8.06       12   2.42       52  10.48
    14     1       10   2.02        4   0.81       14   2.82
    15     3       30   6.05       15   3.02       45   9.07
    16     4       40   8.06       24   4.84       64  12.90
    17     3       30   6.05       21   4.23       51  10.28
    18     1       10   2.02        8   1.61       18   3.63
    19     3       30   6.05       27   5.44       57  11.49
    20     4       40   8.06       40   8.06       80  16.13
    21     1       10   2.02       11   2.22       21   4.23
    22     1       10   2.02       12   2.42       22   4.44
    24     1       10   2.02       14   2.82       24   4.84
------ ----- -------- ------ -------- ------ -------- ------
     >    28      280  56.45      191  38.51      471  94.96

====== ===== ======== ====== ======== ====== ======== ======
     *    31      305  61.49      191  38.51      496 100.00

Averages
========
 Min   Q1  Mid   Q3  Max  Mean Stdev Skew  Tavg 3avg  Mode
---- ---- ---- ---- ----  ---- ----- ----  ---- ----  ----
   7   15 17.0   20   24  16.0   4.0 -0.3  16.0 17.2  16.3

(with --print-distributions, you'd also get)

Distribution for 496 tables with a DD_HASHSIZE 31

lists     .    1    .    2
      123456789012345678901234 - length
  4:              *  *   *
  3:              * *** **
  2:          *   * *** **
  1:        * * ************ *
          .    1    .    2
      123456789012345678901234 - length

(with --print-table, you'd also get )

DD_HASHSIZE 31

Bucket: 0
    alice@testme:testme.rubbing (8029)
    alice@testme:testme.vinegar (8060)
    alice@testme:testme.welcome (8060)
    alice@testme:testme.panting (8153)
    alice@testme:testme.stretched (11470)
    alice@testme:testme.written (8277)
    alice@testme:testme.pretending (13702)
    alice@testme:testme.encourage (11563)
    alice@testme:testme.muttering (11811)

Bucket: 1
    alice@testme:testme.peeping (8092)
    alice@testme:testme.happens (8247)
    [etc]

... so: 31 is plainly too small a DD_HASHSIZE for these tables.

The same tables, with a DD_HASHSIZE of 113 looks much better ...

$ cat alice.txt | ./ddhash_stats --detail --print-distributions --dd_hashsize 113
DD_HASHSIZE             113
DD_HASHMAX               10

Number of tables        496

Estimated Memory (KB)   919..919

Buckets
=======
 Used      % Empty      % Total
----- ------ ----- ------ -----
  113 100.00     0   0.00   113

Lists
=====          <= DD_HASHMAX    > DD_HASHMAX           Total
             --------------- --------------- ---------------
Length Count # Tables      % # Tables      % # Tables      %
------ ----- -------- ------ -------- ------ -------- ------

     1     5        5   1.01        0   0.00        5   1.01
     2    18       36   7.26        0   0.00       36   7.26
     3    21       63  12.70        0   0.00       63  12.70
     4    22       88  17.74        0   0.00       88  17.74
     5    14       70  14.11        0   0.00       70  14.11
     6    13       78  15.73        0   0.00       78  15.73
     7     9       63  12.70        0   0.00       63  12.70
     8     7       56  11.29        0   0.00       56  11.29
     9     3       27   5.44        0   0.00       27   5.44
    10     1       10   2.02        0   0.00       10   2.02
====== ===== ======== ====== ======== ====== ======== ======
     *   113      496 100.00        0   0.00      496 100.00

Averages
========
 Min   Q1  Mid   Q3  Max  Mean Stdev Skew  Tavg 3avg  Mode
---- ---- ---- ---- ----  ---- ----- ----  ---- ----  ----
   1    4  5.0    7   10   4.4   2.1 +0.5   4.5  5.2   4.0

Distribution for 496 tables with a DD_HASHSIZE 113

lists     .    1
      1234567890 - length
 22:     *
 21:    **
 20:    **
 19:    **
 18:   ***
 17:   ***
 16:   ***
 15:   ***
 14:   ****
 13:   *****
 12:   *****
 11:   *****
 10:   *****
  9:   ******
  8:   ******
  7:   *******
  6:   *******
  5:  ********
  4:  ********
  3:  *********
  2:  *********
  1:  **********
          .    1
      1234567890 - length

Duplicate Keys
==============

Prints all the tables which hash to the same key.

These will always end up in the same bucket regardless of DD_HASHSIZE

Short of renaming your tables, there isn't anything you can do about
this.

Use this to get a minimum value for DD_HASHMAX (although you'll
probably need a very large DD_HASHSIZE to actualy hit it).

eg:

$ cat alice.txt | ./ddhash_stats --print-duplicate-keys
Key: 7912 is used 2 times:
    alice@testme:testme.obliged
    alice@testme:testme.dreamed

[... snip ...]

Key: 14195 is used 2 times:
    alice@testme:testme.difficulty
    alice@testme:testme.ridiculous


Number of tables: 496

123 tables have duplicate keys
Total Keys: 427 of which 373 are unique and 54 have duplicates
Worst key(s) are used 4 times


Footnotes
=========

[1] - Prime and Nearly-Prime

http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/c8cbcf34665e8074

Madison Pruet wrote in Message-ID: <39AE60A5.7534CFFD@home.com>
> As a general rule, it's best to keep DD_HASHSIZE as a prime (or near-prime)
> number.  Why?? -- Well obviously we are going to have to be doing some
> dividing (mod)  somewhere to convert the hash number into a bucket number.  As
> with any hashing/mod algorithm, it is possible for collisions to develop.  By
> keeping this to a near-prime number, then you can reduce the probability of
> collisions.  By near prime, I mean a number that is not evenly divisible by
> the prime numbers below 14.  (i.e. 2,3,5,7,11,13).

[2] - More, but shorter lists are better.

http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/97fb8f8393faec7c

Michael Mueller wrote in Message-ID: <b69icl$rld$1@terabinaries.xmission.com>
> [...]
> But for a large multiprocessor there is another important consideration. 
> Every hash list is protected by a mutex called "ddh chain" (all with the 
> same name unfortunately) to make sure that only one thread at a time can 
> either insert or delete an entry or increase or decrease an entry's 
> refcnt. If DD_HASHMAX is too large, the mutex might become a hot spot 
> and onstat -g ath and onstat -g lmx might show threads waiting for that 
> mutex. For that reason it is better to increase DD_HASHSIZE and keep 
> DD_HASHMAX small.

[3] - Memory Requirements

http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/3db0786739df892b

Mark Jamison wrote in Message-ID: <mailman.163.1162411115.29126.informix-list@iiug.org>
> [...] there is very little overhead for DD entries over all , let me see if I can dig up my rough sizing calculations. 
> 
> The base overhead for the Data Dictionary Cache is as follows:
> 
> X1 = 24 + (40 * DD_HASHSIZE)
> 
> Below is the minimum cost of every Data Dictionary entry:
> 
> Y1 = 352 + (96 * Number of Columns per table)
> 
> Total size (rough estimate)
> 
> TS = X1 + (Y1 * (DD_HASHSIZE * DD_HASHMAX)) 

[4] - Memory Requirements

http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/c8cbcf34665e8074

Madison Pruet wrote in Message-ID: <39AE60A5.7534CFFD@home.com>
> The memory required depends on 1) the number of columns in the table, 2) the
> number of indexes in the table, 3) the number of triggers on the table, 4) the
> number of constraints on the table, 5) the number of fragments of the table,
> ...  However, with modern 64-bit versions, you get so much memory
> addressability, maybe it makes sense to simply plan on all tables being loaded
> into memory.  However, on systems running on smaller memory platforms, maybe
> you don't want to do that.  Maybe it makes more sense to accept the cost of
> rebuilding the dictionary cache when it's needed.  Maybe you need your memory
> to be used elsewhere.


@(#) $Id: README,v 1.1 2007/12/07 15:38:03 richard Exp $