DDHASH_STATS ============ For figuring out good values for DD_HASHSIZE and DD_HASHMAX. Prints out statistics for 'onstat -g dic' as if DD_HASHSIZE was set to a particular value. This is based, more or less, on Andy Lennard's awk script in the FAQ (6.11 - How do I configure DD_HASHMAX and DD_HASHSIZE?) (C) Richard Harnden, 2007 Please report bugs: Public Domain. Use/copy at will. No warranties, etc. Building ======== $ gunzip ddhash_stats_1.0.tar.gz $ tar xvf ddhash_stats_1.0.tar $ cd ddhash_stats_1.0 $ make (The Makefile assumes gcc) The hash function ================= The algorithm for hashing the table names has changed with different versions of IDS. *** If you use the wrong one, then all the stats will be wrong. *** onstat -g dic | \ ./ddhash_stats --detail --dd_hashsize --print-table should print the tables in the same buckets as onstat -g dic. If it doesn't, then try adding --hash-old or --hash-very-old I don't know when the hash function was changed, but: 7.22.UC3 and 7.30UC7 need --hash-very-old 9.30.UC2 needs --hash-old 7.31.UD10 and 9.40.UC8 need --hash-new, this is the default. Usage ===== There are 3 modes ... ./ddhash_stats {--summary | -s} ./ddhash_stats {--detail | -d} ./ddhash_stats {--print-duplicate-keys | -k} Summary ======= Prints a table, one DD_HASHMAX per line. ./ddhash_stats \ [--dd_hashmax ] \ [{ --prime | --nearly_prime | --composite }] \ [--min ] [--max ] Where: --dd_hashmax is the DD_HASHMAX you'd like. Keep this small, the default is 10. --prime - use only prime-numbers for DD_HASHSIZE This is the default. --nearly-prime - use composite numbers provided that they don't have any factors <= 13. [1] --composite - use all the numbers prime and composite. --min - start with a DD_HASHSIZE of . The default is / . --max - finish with a DD_HASHSIZE of . The default is 4 * / . eg: $ cat alice.txt | ./ddhash_stats --max 150 Size Used% Min Mid Max Mean Stdev Skew Tavg 3avg Mode EstKB Over% ---- ----- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----- ----- 53 100.0 4 10.0 17 9.4 3.0 +0.5 9.3 10.2 8.0 871 44.2 59 100.0 2 9.0 16 8.4 3.1 +0.3 8.2 9.2 8.5 884 35.5 61 100.0 1 9.0 15 8.1 3.3 +0.0 8.0 9.2 8.0 886 31.0 67 100.0 3 8.0 15 7.4 2.7 +0.6 7.5 8.0 7.0 901 17.7 71 100.0 2 8.0 17 7.0 2.9 +0.7 7.0 7.8 7.2 900 15.9 73 100.0 2 7.0 13 6.8 2.5 +0.5 6.8 7.5 5.0 906 14.5 79 100.0 1 7.0 12 6.3 2.6 +0.2 6.2 7.0 5.0 909 13.9 83 100.0 1 7.0 11 6.0 2.4 -0.2 6.0 7.0 7.0 917 2.2 89 98.9 1 6.0 12 5.6 2.3 +0.3 5.6 6.2 5.0 915 4.6 97 99.0 1 6.0 10 5.2 2.2 +0.3 5.2 6.0 4.0 918 0.0 101 99.0 1 6.0 11 5.0 2.5 +0.7 4.8 6.0 3.0 914 11.1 103 99.0 1 6.0 13 4.9 2.3 +0.4 5.1 6.0 5.0 916 2.6 107 99.1 1 6.0 11 4.7 2.2 +0.3 4.8 5.8 4.5 918 2.2 109 100.0 1 5.0 13 4.6 2.2 +1.0 4.7 5.0 4.3 914 5.0 113 100.0 1 5.0 10 4.4 2.1 +0.5 4.5 5.2 4.0 919 0.0 127 98.4 1 5.0 10 4.0 1.9 +0.5 4.2 5.0 4.0 919 0.0 131 99.2 1 5.0 9 3.8 2.0 +0.6 4.1 4.8 3.0 920 0.0 137 98.5 1 5.0 8 3.7 1.7 +0.3 3.9 4.8 3.0 920 0.0 139 97.8 1 4.0 10 3.6 1.9 +0.7 4.0 4.2 4.0 920 0.0 149 97.3 1 4.0 11 3.4 1.8 +1.0 3.7 4.0 2.0 919 2.2 Where: size is DD_HASHSIZE. Used% is the percentage of buckets that are actually used. Min is the length of the shortest list. Mid is the length of the median list. Max is the length of the longest list. Mean is the arithmetic mean of the used lists. Stdev is the standard deviation. Skew is the skew - +ve is skewed toward shorter lists [2]. Tavg is the truncated mean. 3avg is the tri-mean. Mode is the mode (or the mean of the modes if there are >1) EstKb is a guess. [3][4] Over% is the percentage of lists longer than dd_hashmax. Detail ====== ./ddhash_stats --detail \ --dd_hashsize \ [--dd_hashmax ] \ [--print-table] \ [--print-distributions] Where: --dd_hashsize - use for DD_HASHSIZE. Defaults to 31. --dd_hashmax is the DD_HASHMAX you'd like. Defaults to 10. --print-table - prints the tables in each bucket. --print-distributions - prints a histogram. eg: $ cat alice.txt | ./ddhash_stats --detail DD_HASHSIZE 31 DD_HASHMAX 10 Number of tables 496 Estimated Memory (KB) 564..916 Buckets ======= Used % Empty % Total ----- ------ ----- ------ ----- 31 100.00 0 0.00 31 Lists ===== <= DD_HASHMAX > DD_HASHMAX Total --------------- --------------- --------------- Length Count # Tables % # Tables % # Tables % ------ ----- -------- ------ -------- ------ -------- ------ 7 1 7 1.41 0 0.00 7 1.41 9 2 18 3.63 0 0.00 18 3.63 ------ ----- -------- ------ -------- ------ -------- ------ <= 3 25 5.04 0 0.00 25 5.04 11 1 10 2.02 1 0.20 11 2.22 12 1 10 2.02 2 0.40 12 2.42 13 4 40 8.06 12 2.42 52 10.48 14 1 10 2.02 4 0.81 14 2.82 15 3 30 6.05 15 3.02 45 9.07 16 4 40 8.06 24 4.84 64 12.90 17 3 30 6.05 21 4.23 51 10.28 18 1 10 2.02 8 1.61 18 3.63 19 3 30 6.05 27 5.44 57 11.49 20 4 40 8.06 40 8.06 80 16.13 21 1 10 2.02 11 2.22 21 4.23 22 1 10 2.02 12 2.42 22 4.44 24 1 10 2.02 14 2.82 24 4.84 ------ ----- -------- ------ -------- ------ -------- ------ > 28 280 56.45 191 38.51 471 94.96 ====== ===== ======== ====== ======== ====== ======== ====== * 31 305 61.49 191 38.51 496 100.00 Averages ======== Min Q1 Mid Q3 Max Mean Stdev Skew Tavg 3avg Mode ---- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- 7 15 17.0 20 24 16.0 4.0 -0.3 16.0 17.2 16.3 (with --print-distributions, you'd also get) Distribution for 496 tables with a DD_HASHSIZE 31 lists . 1 . 2 123456789012345678901234 - length 4: * * * 3: * *** ** 2: * * *** ** 1: * * ************ * . 1 . 2 123456789012345678901234 - length (with --print-table, you'd also get ) DD_HASHSIZE 31 Bucket: 0 alice@testme:testme.rubbing (8029) alice@testme:testme.vinegar (8060) alice@testme:testme.welcome (8060) alice@testme:testme.panting (8153) alice@testme:testme.stretched (11470) alice@testme:testme.written (8277) alice@testme:testme.pretending (13702) alice@testme:testme.encourage (11563) alice@testme:testme.muttering (11811) Bucket: 1 alice@testme:testme.peeping (8092) alice@testme:testme.happens (8247) [etc] ... so: 31 is plainly too small a DD_HASHSIZE for these tables. The same tables, with a DD_HASHSIZE of 113 looks much better ... $ cat alice.txt | ./ddhash_stats --detail --print-distributions --dd_hashsize 113 DD_HASHSIZE 113 DD_HASHMAX 10 Number of tables 496 Estimated Memory (KB) 919..919 Buckets ======= Used % Empty % Total ----- ------ ----- ------ ----- 113 100.00 0 0.00 113 Lists ===== <= DD_HASHMAX > DD_HASHMAX Total --------------- --------------- --------------- Length Count # Tables % # Tables % # Tables % ------ ----- -------- ------ -------- ------ -------- ------ 1 5 5 1.01 0 0.00 5 1.01 2 18 36 7.26 0 0.00 36 7.26 3 21 63 12.70 0 0.00 63 12.70 4 22 88 17.74 0 0.00 88 17.74 5 14 70 14.11 0 0.00 70 14.11 6 13 78 15.73 0 0.00 78 15.73 7 9 63 12.70 0 0.00 63 12.70 8 7 56 11.29 0 0.00 56 11.29 9 3 27 5.44 0 0.00 27 5.44 10 1 10 2.02 0 0.00 10 2.02 ====== ===== ======== ====== ======== ====== ======== ====== * 113 496 100.00 0 0.00 496 100.00 Averages ======== Min Q1 Mid Q3 Max Mean Stdev Skew Tavg 3avg Mode ---- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- 1 4 5.0 7 10 4.4 2.1 +0.5 4.5 5.2 4.0 Distribution for 496 tables with a DD_HASHSIZE 113 lists . 1 1234567890 - length 22: * 21: ** 20: ** 19: ** 18: *** 17: *** 16: *** 15: *** 14: **** 13: ***** 12: ***** 11: ***** 10: ***** 9: ****** 8: ****** 7: ******* 6: ******* 5: ******** 4: ******** 3: ********* 2: ********* 1: ********** . 1 1234567890 - length Duplicate Keys ============== Prints all the tables which hash to the same key. These will always end up in the same bucket regardless of DD_HASHSIZE Short of renaming your tables, there isn't anything you can do about this. Use this to get a minimum value for DD_HASHMAX (although you'll probably need a very large DD_HASHSIZE to actualy hit it). eg: $ cat alice.txt | ./ddhash_stats --print-duplicate-keys Key: 7912 is used 2 times: alice@testme:testme.obliged alice@testme:testme.dreamed [... snip ...] Key: 14195 is used 2 times: alice@testme:testme.difficulty alice@testme:testme.ridiculous Number of tables: 496 123 tables have duplicate keys Total Keys: 427 of which 373 are unique and 54 have duplicates Worst key(s) are used 4 times Footnotes ========= [1] - Prime and Nearly-Prime http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/c8cbcf34665e8074 Madison Pruet wrote in Message-ID: <39AE60A5.7534CFFD@home.com> > As a general rule, it's best to keep DD_HASHSIZE as a prime (or near-prime) > number. Why?? -- Well obviously we are going to have to be doing some > dividing (mod) somewhere to convert the hash number into a bucket number. As > with any hashing/mod algorithm, it is possible for collisions to develop. By > keeping this to a near-prime number, then you can reduce the probability of > collisions. By near prime, I mean a number that is not evenly divisible by > the prime numbers below 14. (i.e. 2,3,5,7,11,13). [2] - More, but shorter lists are better. http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/97fb8f8393faec7c Michael Mueller wrote in Message-ID: > [...] > But for a large multiprocessor there is another important consideration. > Every hash list is protected by a mutex called "ddh chain" (all with the > same name unfortunately) to make sure that only one thread at a time can > either insert or delete an entry or increase or decrease an entry's > refcnt. If DD_HASHMAX is too large, the mutex might become a hot spot > and onstat -g ath and onstat -g lmx might show threads waiting for that > mutex. For that reason it is better to increase DD_HASHSIZE and keep > DD_HASHMAX small. [3] - Memory Requirements http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/3db0786739df892b Mark Jamison wrote in Message-ID: > [...] there is very little overhead for DD entries over all , let me see if I can dig up my rough sizing calculations. > > The base overhead for the Data Dictionary Cache is as follows: > > X1 = 24 + (40 * DD_HASHSIZE) > > Below is the minimum cost of every Data Dictionary entry: > > Y1 = 352 + (96 * Number of Columns per table) > > Total size (rough estimate) > > TS = X1 + (Y1 * (DD_HASHSIZE * DD_HASHMAX)) [4] - Memory Requirements http://groups.google.co.uk/group/comp.databases.informix/browse_frm/thread/c8cbcf34665e8074 Madison Pruet wrote in Message-ID: <39AE60A5.7534CFFD@home.com> > The memory required depends on 1) the number of columns in the table, 2) the > number of indexes in the table, 3) the number of triggers on the table, 4) the > number of constraints on the table, 5) the number of fragments of the table, > ... However, with modern 64-bit versions, you get so much memory > addressability, maybe it makes sense to simply plan on all tables being loaded > into memory. However, on systems running on smaller memory platforms, maybe > you don't want to do that. Maybe it makes more sense to accept the cost of > rebuilding the dictionary cache when it's needed. Maybe you need your memory > to be used elsewhere. @(#) $Id: README,v 1.1 2007/12/07 15:38:03 richard Exp $