#!/bin/sh # # This is a shell archive. To extract its contents, # execute this file with /bin/sh to create the file(s): # # metaphone.mk1 metaphone.mk2 metaphone.rm metaphone.sf # # This shell archive created: Tue Aug 27 10:09:27 EDT 1996 # echo "Extracting file metaphone.mk1" sed -e 's/^X//' <<\SHAR_EOF > metaphone.mk1 XFrom: Michael Kuhn XSubject: Metaphone searches XDate: Fri, 24 Nov 1995 11:26:50 -0500 (EST) X XI have tested this "Metaphone" routine to the best of my current Xtime/ability. You can put it in the archive. I talked Sadru Fidai Xabout his routine that is in the archive. I have included the notes Xfrom that conversation in the comments. X XMy opinion is that this algorithm does group names together that are Xmore closely related than Soundex does. However, in my paticular Xsituation it cause the # of matches to increase significantly. I am Xplanning a trial period in the next couple of weeks to see if it Xis managable. X XRobert Minter sent me a 4GL program that is a "hacked" version of Xthis algorithm, ie. it does not included all of the transformation Xrules originally described. This was not much help for what I was Xtrying to do. X XIf anything I got the important stuff from the original article Xwritten down. X XI am sure somebody can have as much fun with this as I did. :-)) X X-- XMichael J. Kuhn Computer Systems Consultant phone:410-254-7060 XEmail: mkuhn@csd.clark.net X mkuhn@rhlab.com or mkuhn%rhlab@uunet.uu.net or uunet!rhlab!mkuhn X c/o Baltimore Rh Typing Laboratory, Inc. phone:410-225-9595 X X****************************** X X/* Metaphone Conversion Notes X X When I found this Algorithm, in article, there were discrepancies between X the BASIC code and the verbal description. The discrepances look like they X could have been caused by typing errors in the article. X X I have included the BASIC code from this article for the specific purpose X of presenting the Algorithm the way it was originally described. X I have tried to reproduce the BASIC EXACTLY the way it appeared. So when X you see "ENAM" with nothing behind it, that is how it was presented. X X Lawrence Philips has no doubt spent a lot of time in the development of X this algorithm. I am trusting that the algorithm described has been X throughly tested to the best of his ability. X X It was my intention to reproduce it using his rules as best as I could X discern them. X X It looks like it works better than Soundex. Thank You Lawrence. X X To anyone passing this along. Please include all of the notes they X are part of the documentation and credits. Thanks X X Mike Kuhn (mkuhn@rhlab.com) X X Michael J. Kuhn Computer Systems Consultant X 5916 Glenoak Ave. X Baltimore, MD 21214-2009 X 410-254-7060 X X P.S. X A version of this routine in the Informix Archive was done by: X X Sadru Fidai Munics Information Systems X 50 Mount Prospect Ave X Clifton NJ 07013 (201)778-7753 X aol.com!SFidai X X Sadru called me to discuss this and said the following: X X His routine was NOT done from the article published in "Computer Language". X He started with a working version from a PICK system that was using this. X He had 2,000+ names with metaphone from the PICK system that he used X to test the C code with. X X You might want to check this routine out. X X I did not use his routine at the time because there was no verbal X explanation of the transformations. Also my intent was to be able X to easily modify the transformation rules with some of my own. X X I did a mod 100 of my 20,726 test names and got 221 scattered names. X I then computed Metaphone for Sadru version and mine. There were 14 X differences. Excluding the trailing S's in his, which I eliminated. X I also changed his code so that O was a ZERO. The differences account X for changes I MADE and interpretation of transformation rules. X X At this point I have no need to do a more comprehensive analysis. X X lastname Mike Kuhn Sadru Fidai X X ANASTHA ANS0 ANSX X DAVIS-CARTER TFSKRTR TFXKRTR X ESCARMANT ESKRMNT EXKRMNT X MCCALL MCL MKKL X MCCROREY MCRR MKKRR X MERSEAL MRSL MRXL X PIEURISSAINT PRSNT PRXNT X ROTMAN RTMN RXMN X SCHEVEL SXFL SKFL X SCHROM SXRM SKRM X SEAL SL XL X SPARR SPR XPR X STARLEPER STRLPR XTRLPR X THRASH TRX 0RX X*/ X X X/*************************************************************** X X Metaphone Algorithm X X Created by Lawrence Philips (location unknown). Metaphone presented X in article in "Computer Language" December 1990 issue. X X Converted from Pick BASIC, as demonstrated in article, to C by X Michael J. Kuhn (Baltimore, Maryland) X X My original intention was to replace SOUNDEX with METAPHONE in X order to get lists of similar sounding names that were more precise. X SOUNDEX maps "William" and "Williams" to the same values. METAPHONE X as it turns out DOES THE SAME. There are going to be problems X that you need to resolve with your own set of data. X X Basically, for my problem with S's I think that if X X IF metaphone[strlen(metaphone)] == "S" X AND strlen(metaphone) >= 4 THEN X X metaphone[strlen(metaphone)] = "" X X You can add you own rules as required. X X Also, Lawrence Philips suggests that for practical reasons only the X first 4 characters of the metaphone be used. This happens to be the X number of characters that Soundex produces. This is indeed practical X if you already have reserved exactly 4 characters in your database. X X In addition an analysis of your data may show that names are split X into undesirable "metaphone groups" as the number of metaphone characters X increases. X X *********** BEGIN METAPHONE RULES *********** X X Lawrence Philips' RULES follow: X X The 16 consonant sounds: X |--- ZERO represents "th" X | X B X S K J T F H L M N P R 0 W Y X X Exceptions: X X Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-" ----> drop first letter X "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright" X X Beginning of word: "x" ----> change to "s" X as in "Deng Xiaopeng" X X Beginning of word: "wh-" ----> change to "w" X as in "Whalen" X X Transformations: X X B ----> B unless at the end of word after "m", as in "dumb", "McComb" X X C ----> X (sh) if "-cia-" or "-ch-" X S if "-ci-", "-ce-", or "-cy-" X SILENT if "-sci-", "-sce-", or "-scy-" X K otherwise, including in "-sch-" X X D ----> J if in "-dge-", "-dgy-", or "-dgi-" X T otherwise X X F ----> F X X G ----> SILENT if in "-gh-" and not at end or before a vowel X in "-gn" or "-gned" X in "-dge-" etc., as in above rule X J if before "i", or "e", or "y" if not double "gg" X K otherwise X X H ----> SILENT if after vowel and no vowel follows X or after "-ch-", "-sh-", "-ph-", "-th-", "-gh-" X H otherwise X X J ----> J X X K ----> SILENT if after "c" X K otherwise X X L ----> L X X M ----> M X X N ----> N X X P ----> F if before "h" X P otherwise X X Q ----> K X X R ----> R X X S ----> X (sh) if before "h" or in "-sio-" or "-sia-" X S otherwise X X T ----> X (sh) if "-tia-" or "-tio-" X 0 (th) if before "h" X silent if in "-tch-" X T otherwise X X V ----> F X X W ----> SILENT if not followed by a vowel X W if followed by a vowel X X X ----> KS X X Y ----> SILENT if not followed by a vowel X Y if followed by a vowel X X Z ----> S X X **************************************************************/ X X/* X X NOTE: This list turned out to be various issues that I passed over X while trying to discern this algorithm. The final outcome X of these items may or may not be reflected in the code. X X There where some discrepancies between the Pick BASIC code in the X original article and the verbal discription of the transformations: X X 1. CASE SYMB = "G" X X AND ENAME[N +3] = "D" AND (N + 3) = L)) and ENAM X ^ X this was cut off in the magazine listing | X X I used the verbal discription in the transformation list X to add the appropriate code. X X 2. H ----> SILENT if after vowel and no vowel follows X H otherwise X X This is the transformation description, however, the BASIC X routine HAS code do this: X X SILENT if after "ch-", "sh-", "ph-", "th", "gh" X X which is the correct behaviour if you look at c,s,p,t,g X X If did not, however, have "after vowel" coded even though this X was in the description. I added it. X X 3. The BASIC code appears to skip double letters except "C" yet X the transformation code for "G" looks at previous letter to X see if we have "GG". This is inconsistent. X X I am making the assumption that "C" was a typo in the BASIC X code. It should have been "G". X X 4. Transformation notation. "-..-" where .. are letters; means that X the letters indicated are bounded by other letters. So "-gned" X means at the end and "ch-" means at the beginning. I have noticed X that the later is not explicity stated in the verbal description X but it is coded in the BASIC. X X 5. case 'C' K otherwise, including in "-sch-" X this implies that "sch" be bounded by other letters. The BASIC X code, however, has: N > 1 X It should have N > 2 for this to be correct. X SCH- X 123 greater than 1 means that C can be 2nd letter X X I coded it as per the verbal description and not what was in X the code. X X 6. as of 11-20-95 I am still trying to understand "H". The BASIC X code seems to indicate that if "-.h" is at the end it is not X silent. But if it is at the end there is no way a vowel could X follow the "h". I am looking for examples. X X 7. ok now I am really confused. Case "T". There is code in BASIC X that says if next = "H" and previous != "T" . There is no X way that a double T goes through the code. Double letters X are dumped in the beginning. X X MATTHEW, MATTHIES, etc X X The first T goes through the second is skipped so the X "th" is never detected. X X Modified routine to allow "G,T" duplicates through the switch. X X 8. case "D" -dge- is indicated in transformation X -dge- or -dge is coded. X X STEMBRIDGE should have "j" on end and not "t" X X I am leaving the code as is, verbal must be wrong. X X 9. Regarding duplicate letters. "C" must be allowed through X as in all of the McC... names. X X The way to handle "GG and "TT" I think is to pass over the X first duplicate. The transformation rules would then handle X duplicates of themselves by looking at the PREVIOUS letter. X X This solves the problems of "TTH" where you want the "th" X sound. X X 10. Change "CC" so that the metaphone character is "C", they X way it is now for McComb and such you get "MKK", which X unnecessiarly eats up and extra metaphone character. X X 11. "TH" at the beginning as in Thomas. The verbal was not X clear about this. I think is should be "T" and not "0" X so I am changing code. X X After the first test I think that "THvowel" should be X "0" and "TH(!vowel)" should be "T" X X 12. I think throwing away 1 "S" and the end would be good. X Since I am doing this anyway after the fact. If I X do it before then names like. .. X BURROUGHS & BURROUGH would be the same X because the GH would map to the same value in X both cases. X X 13. Case "Y", Brian and Bryan give different codes X Don't know how to handle this yet. X X 14. Comments on metaphone groups. Metaphone actually X makes groups bigger. Names like: X X C...R... G...R... K...R... Q...R... X X will map to "KR". Soundex would have produced for example X X C600,C620,G600,G620,K600,K620,Q600,Q620 X X the names from these 8 groups would have been collapsed into 1. X X Another way to look at this is for a more exact initial X guess of a name Soundex would give you a smaller list of X posibilities. If you don't know how to spell it at all X however, your success at finding the right match with X Metaphone is much greater than with Soundex. X X 15. After some tests decided to leave S's at the end of the X Metaphone. #12 takes care of my problems with plurals and X then S gets used to help make distinct metaphone. X X Lawrence Philips is no longer at the company indicated in the X article. So I was unable to verify these items. X*/ X X X#include X#include X#include X X#define TRUE (1) X#define FALSE (0) X#define NULLCHAR (char *) 0 X Xchar *VOWELS="AEIOU", X *FRONTV="EIY", /* special cases for letters in FRONT of these */ X *VARSON="CSPTG", /* variable sound--those modified by adding an "h" */ X *DOUBLE="."; /* let these double letters through */ X Xchar *excpPAIR="AGKPW", /* exceptions "ae-", "gn-", "kn-", "pn-", "wr-" */ X *nextLTR ="ENNNR"; Xchar *chrptr, *chrptr1; X Xvoid phonetic(name,metaph,metalen) Xchar *name, *metaph; Xint metalen; X{ X Xint ii, jj, silent, hard, Lng, lastChr; X Xchar curLtr, prevLtr, nextLtr, nextLtr2, nextLtr3; X Xint vowelAfter, vowelBefore, frontvAfter; X Xchar wname[60]; Xchar *ename=wname; X X jj = 0; X for (ii=0; name[ii] != '\0'; ii++) { X if ( isalpha(name[ii]) ) { X ename[jj] = toupper(name[ii]); X jj++; X } X } X ename[jj] = '\0'; X X if (strlen(ename) == 0) return; X X /* if ae, gn, kn, pn, wr then drop the first letter */ X if ( (chrptr=strchr(excpPAIR,ename[0]) ) != NULLCHAR ) { X chrptr1 = nextLTR + (chrptr-excpPAIR); X if ( *chrptr1 == ename[1] ) strcpy(ename,&ename[1]); X } X /* change x to s */ X if (ename[0] == 'X') ename[0] = 'S'; X /* get rid of the "h" in "wh" */ X if ( strncmp(ename,"WH",2) == 0 ) strcpy(&ename[1], &ename[2]); X X Lng = strlen(ename); X lastChr = Lng -1; /* index to last character in string makes code easier*/ X X /* Remove an S from the end of the string */ X if ( ename[lastChr] == 'S' ) { X ename[lastChr] = '\0'; X Lng = strlen(ename); X lastChr = Lng -1; X } X X for (ii=0; ( (strlen(metaph) < metalen) && (ii < Lng) ); ii++) { X X curLtr = ename[ii]; X X vowelBefore = FALSE; prevLtr = ' '; X if (ii > 0) { X prevLtr = ename[ii-1]; X if ( strchr(VOWELS,prevLtr) != NULLCHAR ) vowelBefore = TRUE; X } X /* if first letter is a vowel KEEP it */ X if (ii == 0 && (strchr(VOWELS,curLtr) != NULLCHAR) ) { X strncat(metaph,&curLtr,1); X continue; X } X X vowelAfter = FALSE; frontvAfter = FALSE; nextLtr = ' '; X if ( ii < lastChr ) { X nextLtr = ename[ii+1]; X if ( strchr(VOWELS,nextLtr) != NULLCHAR ) vowelAfter = TRUE; X if ( strchr(FRONTV,nextLtr) != NULLCHAR ) frontvAfter = TRUE; X } X /* skip double letters except ones in list */ X if (curLtr == nextLtr && (strchr(DOUBLE,nextLtr) == NULLCHAR) ) continue; X X nextLtr2 = ' '; X if (ii < (lastChr-1) ) nextLtr2 = ename[ii+2]; X X nextLtr3 = ' '; X if (ii < (lastChr-2) ) nextLtr3 = ename[ii+3]; X X switch (curLtr) { X X case 'B': silent = FALSE; X if (ii == lastChr && prevLtr == 'M') silent = TRUE; X if (! silent) strncat(metaph,&curLtr,1); X break; X X /*silent -sci-,-sce-,-scy-; sci-, etc OK*/ X case 'C': if (! (ii > 1 && prevLtr == 'S' && frontvAfter) ) X X if ( ii > 0 && nextLtr == 'I' && nextLtr2 == 'A' ) X strncat(metaph,"X",1); X else X if (frontvAfter) X strncat(metaph,"S",1); X else X if (ii > 1 && prevLtr == 'S' && nextLtr == 'H') X strncat(metaph,"K",1); X else X if (nextLtr == 'H') X if (ii == 0 && (strchr(VOWELS,nextLtr2) == NULLCHAR) ) X strncat(metaph,"K",1); X else X strncat(metaph,"X",1); X else X if (prevLtr == 'C') X strncat(metaph,"C",1); X else X strncat(metaph,"K",1); X break; X X case 'D': if (nextLtr == 'G' && (strchr(FRONTV,nextLtr2) != NULLCHAR)) X strncat(metaph,"J",1); X else X strncat(metaph,"T",1); X break; X X case 'G': silent=FALSE; X /* SILENT -gh- except for -gh and no vowel after h */ X if ( (ii < (lastChr-1) && nextLtr == 'H') X && (strchr(VOWELS,nextLtr2) == NULLCHAR) ) X silent=TRUE; X X if ( (ii == (lastChr-3) ) X && nextLtr == 'N' && nextLtr2 == 'E' && nextLtr3 == 'D') X silent=TRUE; X else X if ( (ii == (lastChr-1)) && nextLtr == 'N') silent=TRUE; X X if (prevLtr == 'D' && frontvAfter) silent=TRUE; X X if (prevLtr == 'G') X hard=TRUE; X else X hard=FALSE; X X if (!silent) X if (frontvAfter && (! hard) ) X strncat(metaph,"J",1); X else X strncat(metaph,"K",1); X break; X X case 'H': silent = FALSE; X if ( strchr(VARSON,prevLtr) != NULLCHAR ) silent = TRUE; X X if ( vowelBefore && !vowelAfter) silent = TRUE; X X if (!silent) strncat(metaph,&curLtr,1); X break; X X case 'F': X case 'J': X case 'L': X case 'M': X case 'N': X case 'R': strncat(metaph,&curLtr,1); X break; X X case 'K': if (prevLtr != 'C') strncat(metaph,&curLtr,1); X break; X X case 'P': if (nextLtr == 'H') X strncat(metaph,"F",1); X else X strncat(metaph,"P",1); X break; X X case 'Q': strncat(metaph,"K",1); X break; X X case 'S': if (ii > 1 && nextLtr == 'I' X && ( nextLtr2 == 'O' || nextLtr2 == 'A') ) X strncat(metaph,"X",1); X else X if (nextLtr == 'H') X strncat(metaph,"X",1); X else X strncat(metaph,"S",1); X break; X X case 'T': if (ii > 1 && nextLtr == 'I' X && ( nextLtr2 == 'O' || nextLtr2 == 'A') ) X strncat(metaph,"X",1); X else X if (nextLtr == 'H') /* The=0, Tho=T, Withrow=0 */ X if (ii > 0 || (strchr(VOWELS,nextLtr2) != NULLCHAR) ) X strncat(metaph,"0",1); X else X strncat(metaph,"T",1); X else X if (! (ii < (lastChr-2) && nextLtr == 'C' && nextLtr2 == 'H')) X strncat(metaph,"T",1); X break; X X case 'V': strncat(metaph,"F",1); X break; X X case 'W': X case 'Y': if (ii < lastChr && vowelAfter) strncat(metaph,&curLtr,1); X break; X X case 'X': strncat(metaph,"KS",2); X break; X X case 'Z': strncat(metaph,"S",1); X break; X } X X } X X/* DON'T DO THIS NOW, REMOVING "S" IN BEGINNING HAS the same effect X with plurals, in addition imbedded S's in the Metaphone are included X Lng = strlen(metaph); X lastChr = Lng -1; X if ( metaph[lastChr] == 'S' && Lng >= 3 ) metaph[lastChr] = '\0'; X*/ X X return; X} X Xint metaphone(argc) Xint argc; X{ X X char name[128]; X char metaph[50]; X X if (argc != 1) { X fprintf(stderr, "metaphone: argc != 1\n"); X retquote(""); X return(1); X } X X popquote(name, sizeof(name)); X X metaph[0] = '\0'; X phonetic(name,metaph,20); X X retquote(metaph); X return(1); X} X X/*********** BEGIN BASIC CODE from Computer Language, Dec 1990 ************ X* XSUBROUTINE METAPHONE(NAME, METAPH) X XEQU VOWELS TO "AEIOU" XEQU FRONTV TO "EIY" XEQU VARSON TO "CSPTG" X* As in variable sound--those modified by adding an "h" XNAME = ICONV(NAME, 'MCU') XENAME = ICONV(NAME, 'MCA') X* Delete nonalphanumeric characters and make all caps XIF ENAME = "" THEN RETURN XTWO = ENAME[1,2] XIF TWO = "PN" OR TWO = "AE" OR TWO = "KN" OR TWO = "GN" OR TWO = "WR" THEN ENAME = ENAME[2,9999] XIF ENAME[1,1] = "X" THEN ENAME = "S":ENAME[2,9999] XIF TWO = "WH" THEN ENAME = "W":ENAME[3,9999] X* Convert to metaph XL = LEN(ENAME) XMETAPH = ''; NEW = 1; HARD = 0 XFOR N = 1 TO L WHILE LEN(METAPH) < 4 X SYMB = ENAME[N,1] X IF SYMB # "C" AND N > 1 AND ENAME[N - 1,1] = SYMB THEN NEW = 0 ELSE NEW = 1 X IF NEW = 1 THEN X BEGIN CASE X CASE INDEX(VOWELS,SYMB,1) > 0 AND N = 1 X METAPH = SYMB X X CASE SYMB = "B" X IF N = L AND ENAME[N - 1,1] = "M" THEN SILENT = 1 ELSE SILENT = 0 X IF NOT(SILENT) THEN METAPH = METAPH:SYMB X X CASE SYMB = "C" X IF NOT(N > 1 AND ENAME[N - 1,1] = "S" AND (N + 1) <= L AND INDEX(FRONTV,ENAME[N + 1],1) > 0) THEN X IF (N + 2) <= L AND ENAME[N + 1,1] = "I" AND ENAME[N + 2,1] = "A" THEN X METAPH = METAPH:"X" X END ELSE X IF N < L AND INDEX(FRONTV,ENAME[N + 1,1],1) > 0 THEN X METAPH = METAPH:"S" X END ELSE X IF N > 1 AND N < L AND ENAME[N + 1,1] = "H" AND ENAME[N - 1,1] = "S" THEN X METAPH = METAPH:"K" X END ELSE X IF N < L AND ENAME[N + 1,1] = "H" THEN X IF N = 1 AND (N + 2) <= L AND INDEX(VOWELS,ENAME[N + 2,1],1) = 0 THEN X METAPH = METAPH:"K" X END ELSE X METAPH = METAPH:"X" X END X END ELSE X METAPH = METAPH:"K" X END X END X END X END X END X X CASE SYMB = "D" X IF (N + 2) <= L AND ENAME[N + 1,1] = "G" AND INDEX(FRONTV,ENAME[N + 2,1],1) > 0 THEN X METAPH = METAPH:"J" X END ELSE X METAPH = METAPH:"T" X END X X CASE SYMB = "G" X IF N < L AND ENAME[N + 1,1] = "H" AND INDEX(VOWELS,ENAME[N + 2,1],1) = 0 THEN SILENT = 1 ELSE SILENT = 0 X IF N > 1 AND ((N + 1) = L OR (ENAME[N + 1,1] = "N" AND ENAME[N + 2,1] = "E" AND ENAME[N + 3,1] = "D" AND (N + 3) = L)) AND ENAM X X IF N > 1 AND (N + 1) <= L AND ENAME[N - 1,1] = "D" AND INDEX(FRONTV,ENAME[N + 1,1],1) > 0 THEN SILENT = 1 X IF N > 1 AND ENAME[N - 1,1] = "G" THEN HARD = 1 ELSE HARD = 0 X IF NOT(SILENT) THEN X IF N < L AND INDEX(FRONTV,ENAME[N + 1,1],1) > 0 AND NOT(HARD) THEN X METAPH = METAPH:"J" X END ELSE X METAPH = METAPH:"K" X END X END X X CASE SYMB = "H" X IF NOT(N = L OR (N > 1 AND INDEX(VARSON,ENAME[N - 1,1],1) > 0)) THEN X IF INDEX(VOWELS,ENAME[N + 1,1],1) > 0 THEN X METAPH = METAPH:"H" X END X END X X CASE SYMB = "F" OR SYMB = "J" OR SYMB = "L" OR SYMB = "M" OR SYMB = "N" OR SYMB = "R" X METAPH = METAPH:SYMB X X CASE SYMB = "K" X IF N > 1 AND ENAME[N - 1,1] # "C" THEN X METAPH = METAPH:"K" X END ELSE X IF N = 1 THEN X METAPH = "K" X END X END X X CASE SYMB = "P" X IF N < L AND ENAME[N + 1,1] = "H" THEN X METAPH = METAPH:"F" X END ELSE X METAPH = METAPH:"P" X END X X CASE SYMB = "Q" X METAPH = METAPH:"K" X X CASE SYMB = "S" X IF N > 1 AND (N + 2) <= L AND ENAME[N + 1,1] = "I" AND (ENAME[N + 2,1] = "O" OR ENAME[N + 2,1] = "A" ) THEN X METAPH = METAPH:"X" X END ELSE X IF N < L AND ENAME[N + 1,1] = "H" THEN X METAPH = METAPH:"X" X END ELSE X METAPH = METAPH:"S" X END X END X X CASE SYMB = "T" X IF N > 1 AND (N + 2) <= L AND ENAME[N + 1,1] = "I" AND (ENAME[N + 2,1] = "O" OR ENAME[N + 2,1] = "A" THEN X METAPH = METAPH:"X" X END ELSE X IF N < L AND ENAME[N + 1,1] = "H" THEN X IF NOT(N > 1 AND ENAME[N - 1,1] = "T" THEN X METAPH = METAPH:"0" X END X END ELSE X IF NOT(ENAME[N + 1,1] = "C" AND ENAME[N + 2,1] = "H") THEN X METAPH = METAPH:"T" X END X END X END X X CASE SYMB = "V" X METAPH = METAPH:"F" X X CASE SYMB = "W" OR SYMB = "Y" X IF N < L AND INDEX(VOWELS, ENAME[N + 1,1],1) > 0 THEN METAPH=METAPH:SYMB X X CASE SYMB = "X" X METAPH = METAPH:"KS" X X CASE SYMB = "Z" X METAPH = METAPH:"S" X END CASE X END XNEXT N X* XRETURN X* XEND X*********** END OF BASIC CODE from Computer Language, Dec 1990 ************ */ X SHAR_EOF if [ `wc -c < metaphone.mk1` -ne 25532 ] then echo "Lengths do not match -- Bad Copy of metaphone.mk1" fi echo "Extracting file metaphone.mk2" sed -e 's/^X//' <<\SHAR_EOF > metaphone.mk2 XFrom: mkuhn@rhlab.UUCP (Michael Kuhn) XNewsgroups: comp.databases.informix XSubject: Phonetic search codes XDate: 29 Nov 95 16:04:22 GMT X XThe following describes a replacement for the traditional "Soundex" coding Xused to create a "phonetic" string for persons names. X XI have produced a C version of this routine and sent it for inclusion Xin the Informix archive. You can find it there. X XOver the last 3 weeks I have looked at the differences between using X"Soundex" over "Metaphone" for retrieval of possible matches of persons Xin the database. X XI used 20,000+ unique last names and found that for the most part XMetaphone does group together names that Soundex would not. The primary Xreason for this is that Soundex ALWAYS uses the first letter of the Xname as part of the code. X XHowever, with Metaphone you find MANY instances where the following Xoccurs: X FALLIS & VALLIS would have DIFFERENT Soundex X KLINE & CLINE SAME Metaphone X XI personally have not implemented it yet because. . . X X KLINE & CLINE in the same "group" means that there are many, many, X many, many more names that are going to be retrieved for each X target group that you are going after. X XMy paticular problem is to find matching names that are SOMETIMES Xmispelled slightly. The input persons usually have the first letter Xof the last name. Its the stuff in the middle that is screwed up. X XBasically, I am looking for a "fuzzy filter". A couple of years Xago in Byte magazine and Unix Review there were articles on "agrep" X(approximate grep) that used some new algorithms and expanded adapations Xof old algorithms. X XI took collections of names with the same metaphone groups and put them Xin an ascii file and used agrep to search for names by misspelling them, etc. X XThe results where "stunning" in my opinion. As an example 58 last names Xfrom a metaphone "group" where reduced to a set of 14 BEST GUESSes by agrep. X XI looked at the C code for agrep. In order to get the basic "fuzzy filter" Xroutine to call from 4GL would require someone to get really familiar with Xthese algorithms. More time than I have right now. X XWhile y'all are driving around on the net keep a lookout for something Xlike this, that can be imbedded in 4GL. X XThanks. X X X Metaphone Algorithm X X Created by Lawrence Philips (location unknown). Metaphone presented X in article in "Computer Language" December 1990 issue. X X *********** BEGIN METAPHONE RULES *********** X X Lawrence Philips' RULES follow: X X The 16 consonant sounds: X |--- ZERO represents "th" X | X B X S K J T F H L M N P R 0 W Y X X Exceptions: X X Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-" ----> drop first letter X "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright" X X Beginning of word: "x" ----> change to "s" X as in "Deng Xiaopeng" X X Beginning of word: "wh-" ----> change to "w" X as in "Whalen" X X Transformations: X X B ----> B unless at the end of word after "m", as in "dumb", "McComb" X X C ----> X (sh) if "-cia-" or "-ch-" X S if "-ci-", "-ce-", or "-cy-" X SILENT if "-sci-", "-sce-", or "-scy-" X K otherwise, including in "-sch-" X X D ----> J if in "-dge-", "-dgy-", or "-dgi-" X T otherwise X X F ----> F X X G ----> SILENT if in "-gh-" and not at end or before a vowel X in "-gn" or "-gned" X in "-dge-" etc., as in above rule X J if before "i", or "e", or "y" if not double "gg" X K otherwise X X H ----> SILENT if after vowel and no vowel follows X or after "-ch-", "-sh-", "-ph-", "-th-", "-gh-" X H otherwise X X J ----> J X X K ----> SILENT if after "c" X K otherwise X X L ----> L X X M ----> M X X N ----> N X X P ----> F if before "h" X P otherwise X X Q ----> K X X R ----> R X X S ----> X (sh) if before "h" or in "-sio-" or "-sia-" X S otherwise X X T ----> X (sh) if "-tia-" or "-tio-" X 0 (th) if before "h" X silent if in "-tch-" X T otherwise X X V ----> F X X W ----> SILENT if not followed by a vowel X W if followed by a vowel X X X ----> KS X X Y ----> SILENT if not followed by a vowel X Y if followed by a vowel X X Z ----> S X-- XMichael J. Kuhn Computer Systems Consultant phone:410-254-7060 XEmail: mkuhn@csd.clark.net X mkuhn@rhlab.com or mkuhn%rhlab@uunet.uu.net or uunet!rhlab!mkuhn X c/o Baltimore Rh Typing Laboratory, Inc. phone:410-225-9595 X SHAR_EOF if [ `wc -c < metaphone.mk2` -ne 4898 ] then echo "Lengths do not match -- Bad Copy of metaphone.mk2" fi echo "Extracting file metaphone.rm" sed -e 's/^X//' <<\SHAR_EOF > metaphone.rm XFrom: Robert Minter XSubject: Re: Metaphone XTo: Walt Hultgren {rmy} XDate: Fri, 17 Nov 1995 14:06:40 -0800 (PST) X X* Michael Kuhn and I have been discussing his Metaphone 4GL code. He is going X* to place a copy in the FTP archive that I maintain on mathcs.emory.edu. He X* mentioned that you had sent him some of your code that implements the same X* algorithm. I would like to have a copy of your code for the archive also X* if you don't mind. If you would be willing to do that, please let me know. X XSure, however, it is a (modified) version of the Metaphone routine. It works Xthe basically the same as the original routine, though. Much simplier. X X --v-- SNIP --v-- SNIP --v-- SNIP --v-- SNIP --v-- SNIP --v-- SNIP --v-- X#------------------------------------------------------------------------------ X# Module: meta_ph.4gl X# X# Hacked version of metaphone routine X#------------------------------------------------------------------------------ X XDEFINE X meta_len SMALLINT, X org_name CHAR(50) X X#---------------------------------------------------------------------------- X# Metaphone() - Public function X#---------------------------------------------------------------------------- XFUNCTION Metaphone(pass_name) X XDEFINE X pass_name CHAR(50), X idx SMALLINT, X new_char SMALLINT, X ret_name CHAR(50) X X -- Initialization X LET pass_name = UPSHIFT(pass_name) X LET org_name = "" X X -- Parse out unwanted's X FOR idx = 1 TO LENGTH(pass_name) X IF pass_name[idx] MATCHES "[A-Za-z0-9]" THEN X LET org_name = org_name CLIPPED, pass_name[idx] X END IF X END FOR X X -- If no length, return X IF NOT LENGTH(org_name) THEN X RETURN "" X END IF X X -- More initialization X LET meta_len = LENGTH(org_name) X LET ret_name = org_name[1] X X -- Main loop to generate metaphone X FOR idx = 2 TO meta_len X IF org_name[idx] NOT MATCHES "[0-9]" THEN X LET ret_name = ret_name CLIPPED, DoLogic(idx) X END IF X END FOR X X RETURN ret_name X XEND FUNCTION X X#------------------------------------------------------------------------------ X# do_logic - Private to module X#------------------------------------------------------------------------------ XFUNCTION DoLogic(this_idx) X XDEFINE X this_idx SMALLINT, X this_char CHAR, X hard SMALLINT X X LET this_char = org_name[this_idx] X X CASE X WHEN this_char = "B" X IF this_idx = meta_len THEN X IF org_name[this_idx - 1] != "M" THEN X RETURN this_char X END IF X ELSE X RETURN this_char X END IF X X WHEN this_char = "C" X IF this_idx + 1 <= meta_len THEN X IF org_name[this_idx - 1] = "S" AND X org_name[this_idx + 1] MATCHES "[EIY]" THEN X RETURN "" X END IF X END IF X IF this_idx + 2 <= meta_len THEN X IF org_name[this_idx + 1] = "I" and X org_name[this_idx + 2] = "A" then X RETURN "X" X END IF X END IF X IF this_idx < meta_len THEN X IF org_name[this_idx + 1] MATCHES "[EIY]" THEN X RETURN "S" X END IF X IF org_name[this_idx + 1] = "H" AND X org_name[this_idx - 1] = "S" THEN X RETURN "K" X END IF X IF org_name[this_idx + 1] = "H" THEN X IF this_idx + 2 <= meta_len THEN X IF org_name[this_idx + 2] MATCHES "[AEIOU]" THEN X RETURN "K" X ELSE X RETURN "X" X END IF X ELSE X RETURN "X" X END IF X END IF X END IF X RETURN "K" X X WHEN this_char = "D" X IF this_idx + 2 <= meta_len THEN X IF org_name[this_idx + 1] = "G" AND X org_name[this_idx + 2] MATCHES "[EIY]" THEN X RETURN "J" X END IF X END IF X RETURN "T" X X WHEN this_char = "G" X IF this_idx + 2 <= meta_len THEN X IF org_name[this_idx + 1] = "H" AND X org_name[this_idx + 2] MATCHES "[AEIOUT]" THEN X RETURN "" X END IF X END IF X IF this_idx + 1 = meta_len THEN X IF org_name[this_idx + 1] = "N" THEN X RETURN "" X END IF X END IF X IF this_idx + 3 = meta_len THEN X IF org_name[this_idx + 1] = "N" AND X org_name[this_idx + 2] = "E" AND X org_name[this_idx + 3] = "D" THEN X RETURN "" X END IF X END IF X IF this_idx + 1 <= meta_len THEN X IF org_name[this_idx - 1] = "D" AND X org_name[this_idx + 1] MATCHES "[EIY]" THEN X RETURN "" X END IF X END IF X IF this_idx < meta_len THEN X IF org_name[this_idx + 1] MATCHES "[EIY]" THEN X RETURN "J" X END IF X END IF X RETURN "K" X X WHEN this_char = "H" X IF this_idx = meta_len THEN X RETURN "" X END IF X IF org_name[this_idx - 1] MATCHES "[CSPTG]" THEN X RETURN "" X END IF X IF this_idx + 1 <= meta_len THEN X IF org_name[this_idx + 1] MATCHES "[AEIOU]" THEN X RETURN this_char X END IF X END IF X X WHEN this_char = "K" X IF org_name[this_idx - 1] NOT MATCHES "[0-9]" THEN X RETURN this_char X END IF X X WHEN this_char = "P" X IF this_idx < meta_len THEN X IF org_name[this_idx + 1] = "H" THEN X RETURN "F" X END IF X END IF X RETURN this_char X X WHEN this_char = "Q" X RETURN "K" X X WHEN this_char = "S" X IF this_idx + 2 <= meta_len THEN X IF org_name[this_idx + 1] = "I" AND X org_name[this_idx + 2] MATCHES "[AO]" THEN X RETURN "X" X END IF X END IF X IF this_idx < meta_len THEN X IF org_name[this_idx + 1] = "H" THEN X RETURN "X" X END IF X END IF X RETURN this_char X X WHEN this_char = "T" X IF this_idx + 2 <= meta_len THEN X IF org_name[this_idx + 1] = "I" AND X org_name[this_idx + 2] MATCHES "[AO]" THEN X RETURN "X" X END IF X IF org_name[this_idx + 1] = "C" AND X org_name[this_idx + 2] = "H" THEN X RETURN "" X END IF X END IF X IF this_idx < meta_len THEN X IF org_name[this_idx + 1] = "H" THEN X IF org_name[this_idx - 1] = "T" THEN X RETURN "" X ELSE X RETURN "O" X END IF X END IF X END IF X RETURN this_char X X WHEN this_char = "W" OR this_char = "Y" X IF this_idx < meta_len THEN X IF org_name[this_idx + 1] MATCHES "[AEIOU]" THEN X RETURN this_char X END IF X END IF X X OTHERWISE X IF this_char MATCHES "[A-Z]" AND X this_char NOT MATCHES "[AEIOU]" THEN X RETURN this_char X END IF X END CASE X X RETURN "" X XEND FUNCTION X --^-- SNIP --^-- SNIP --^-- SNIP --^-- SNIP --^-- SNIP --^-- SNIP --^-- X X Robert Minter Data Systems Support \\\_/// XSenior Software Engineer A Client Technologies Company ( _ _ ) XE-Mail: rob@dssmktg.com Tel: 714.771.0454 (| ^ |) X#include Fax: 714.771.3028 \`-'/ XDe Colores - Emmaus OC-13 SURF'S UP \_/ X SHAR_EOF if [ `wc -c < metaphone.rm` -ne 8559 ] then echo "Lengths do not match -- Bad Copy of metaphone.rm" fi echo "Extracting file metaphone.sf" sed -e 's/^X//' <<\SHAR_EOF > metaphone.sf XFrom: SFidai@aol.com XSubject: Phonetic searching algori... XTo: walt@mathcs.emory.edu XDate: Tue, 7 Feb 1995 17:48:48 -0500 X XHi guys, X XI apologize for the delay. I called a collegue at my previous Xjob to get me that article so that I can properly credit. But XI can't wait forever. As soon as I come to know the author and Xthe journal/issue where this article was appeared, I will let XWalt Hultgren know so that I can update PD archives. X XI ported this algorithm from an article where it was in XPICK/BASIC. I simply ported that to C. I tried to keep Xall the veriable names same. By the way, I do not program Xin C regularly, if you find any silly stuff please correct it. X XI think it uses pretty generic stuff, but to make it clear, I Xhave only tested this on IBM RS6000 AIX 3.X. X XThe program first accepts # of chars to be output as a metaphone Xcode. After that it goes into loop where it accepts name and Xoutputs metaphone code. The author, who if I remember it correctly Xis an AI scientist, claims that this algorithm is much more Xsuperior to soundex. One of the drawback of the soundex is Xthat you have to know the first char of the name. With metaphone Xit is not like that. It strickly goes by how it sounds. X XI think the logic is streightforward enough that it can be Xwritten in Informix/4GL easily. May be someone would volinteer. X XSadru Fidai XMunics Information Systems X50 Mount Prospect Ave XClifton NJ 07013 X(201)778-7753 X X XFrom: SFidai@aol.com XDate: Tue, 21 Feb 1995 17:43:34 -0500 XSubject: Metaphone X XHi guys, X XGuess what? I found the article. Not from my previous job but Xfrom internet. I posted a message on usenet group on pick and Xa fellow from Australia replied and he even faxed me that Xarticle. X XThe article appeared in the magazine called "Computer Language" XDecember 1990 issue and is written by Lawrence Philips. The Xauther is/was an artificial intelligence specialist at NAC Xreinsurance in Greenwich, Connecticut. X XPlease make a note of this. X XThanks, X XSadru Fidai X X X----------------------------------------Cut Here------------------ X/*************************************************************** X * X * Metaphone.c X * X * X * X * X * X * X **************************************************************/ X X X#include X#include X#include X#include X Xchar *VOWELS="AEIOU", X *FRONTV="EIY", X *VARSON="CSPTG"; X Xvoid metaphone(name,metaph,metalen) Xchar *name, *metaph; Xint metalen; X{ X Xint i,j,new, silent,L,hard; Xchar *ename, *tmpstr,symb; Xchar two[3]; X Xename=malloc(strlen(name) * sizeof(char)); X Xmemset(ename, '\0', strlen(name)); Xmemset(metaph, '\0', strlen(metaph)); X Xj=0; X Xfor (i=0;name[i]!='\0';i++) X{ X name[i]=toupper(name[i]); X X if (isalnum(name[i])) X { X ename[j]=name[i]; X j++; X } X} X Xename[j]='\0'; X Xif (strlen(ename) == 0) X{ X free(ename); X return; X} X X/* realloc(ename, strlen(ename), sizeof(char)); */ X Xstrncpy(two,ename,2); X Xif ((!strncmp(two,"PN",2)) || (!strncmp(two,"AE",2)) || X(!strncmp(two,"KN",2)) X || (!strncmp(two,"GN",2)) || (!strncmp(two,"WR",2)) || X(!strncmp(two,"AE",2))) X{ X strcpy(ename,&ename[1]); X} X X Xif (ename[0] == 'X') X{ X tmpstr=strdup(ename); X ename[0] = 'S'; X strcpy(&ename[1],&tmpstr[1]); X free(tmpstr); X} X Xif (!strncmp(two,"WH",2)) X{ X tmpstr=strdup(ename); X ename[0] = 'W'; X strcpy(&ename[1],&tmpstr[2]); X free(tmpstr); X} X XL=strlen(ename); X Xfor (i=0;((strlen(metaph) < metalen) && (i < L));i++) X{ X symb=ename[i]; X X if ((symb != 'C') && (i > 0) && (ename[i-1] == symb)) X new=0; X else X new=1; X X if (new == 1) X { X if ((strrchr(VOWELS,symb) != (char *)NULL) && (i == 0)) X metaph[0]=symb; X X else if ( symb == 'B') X { X if (( i == (L -1)) && (ename[i-1] == 'M')) X silent=1; X else X silent=0; X X if (!silent) X strncat(metaph,&symb,1); X } X else if ( symb == 'C') X { X if (!((i > 0) && (ename[i-1] == 'S') && ((i+1) <= (L-1)) X && (strrchr(FRONTV,ename[i+1]) != (char *)NULL))) X { X if (((i+2) <= (L-1)) && (ename[i+1] == 'I') && (ename[i+2] == X'A')) X strncat(metaph,"X",1); X else X { X if ((i < (L-1)) && (strrchr(FRONTV,ename[i+1]) != (char X*)NULL)) X strncat(metaph,"S",1); X else X if ((i >0) && (i < (L-1)) && (ename[i+1] == 'H') X && (ename[i-1] == 'S')) X strncat(metaph,"K",1); X else X if ((i < (L-1)) && (ename[i+1] == 'H')) X if ((i == 0) && ((i+2) <= (L-1)) X && (strrchr(VOWELS,ename[i+2]) == (char *)NULL)) X strncat(metaph,"K",1); X else X strncat(metaph,"X",1); X else X strncat(metaph,"K",1); X } X } X } X else if ( symb == 'D') X { X if (((i+2) <= (L-1)) && (ename[i+1] == 'G') X && (strrchr(FRONTV,ename[i+2]) != (char *)NULL)) X strncat(metaph,"J",1); X else X strncat(metaph,"T",1); X } X else if ( symb == 'G') X { X if ((i < (L-1)) && ( ename[i+1] == 'H') X && (strrchr(VOWELS,ename[i+2]) == (char *)NULL)) X silent=1; X else X silent=0; X X if ((i > 0) X && (((i+1) == (L-1)) || ((ename[i+1] == 'N') X && (ename[i+2] == 'E') X && (ename[i+3] == 'D') X && ((i+3) == (L-1)))) X && (ename[i+1] == 'N')) X silent=1; X X if ((i > 0) && ((i+1) <= (L-1)) && (ename[i-1] == 'D') X && (strrchr(FRONTV,ename[i+1]) != (char *)NULL)) X silent=1; X X if ((i > 0) && (ename[i-1] == 'G')) X hard=1; X else X hard=0; X X if (!silent) X if ((i < (L-1)) X && (strrchr(FRONTV,ename[i+1]) != (char *)NULL) X && (!hard)) X strncat(metaph,"J",1); X else X strncat(metaph,"K",1); X } X else if ( symb == 'H') X { X if (!(( i == (L-1)) || ((i > 0) X && (strrchr(VARSON,ename[i-1]) != (char X*)NULL)))) X if (strrchr(VOWELS,ename[i+1]) != (char *)NULL) X strncat(metaph,"H",1); X } X else if ((symb == 'F') X || (symb == 'J') X || (symb == 'L') X || (symb == 'M') X || (symb == 'N') X || (symb == 'R')) X { X strncat(metaph,&symb,1); X } X else if ( symb == 'K') X { X if ((i > 0) && (ename[i-1] != 'C')) X strncat(metaph,"K",1); X else X if (i == 0) X metaph[0]='K'; X } X X else if ( symb == 'P') X { X if ((i < (L-1)) && (ename[i+1] == 'H')) X strncat(metaph,"F",1); X else X strncat(metaph,"P",1); X } X else if ( symb == 'Q') X { X strncat(metaph,"K",1); X } X else if ( symb == 'S') X { X /** X if ((i > 0) && ((i+2) <= (L-1)) && (ename[i+1] == 'I') X && (ename[i+2] == 'O') || (ename[i+2] == 'A')) X **/ X if ((i >= 0) && ((i+2) <= (L-1)) && (((ename[i+1] == 'I') X && (ename[i+2] == 'O')) || (ename[i+2] == 'A'))) X strncat(metaph,"X",1); X else X if (( i < (L-1)) && (ename[i+1] == 'H')) X strncat(metaph,"X",1); X else X strncat(metaph,"S",1); X } X else if ( symb == 'T') X { X /***** X if ((i > 0) && ((i+2) <= (L-1)) && (ename[i+1] == 'I') X && ((ename[i+2] == 'O') || (ename[i+2] == 'A'))) X X if ((i > 0) && ((i+2) <= (L-1)) && (((ename[i+1] == 'I') X && (ename[i+2] == 'O')) || (ename[i+2] == 'A'))) X strncat(metaph,"X",1); X if ((i > 0) && ((i+2) < (L-1)) && (((ename[i+1] == 'I') X && (ename[i+2] == 'O')) || (ename[i+2] == 'A'))) X strncat(metaph,"X",1); X *****/ X if ((i >= 0) && ((i+2) <= (L-1)) && (((ename[i+1] == 'I') X && (ename[i+2] == 'O')) || (ename[i+2] == 'A'))) X strncat(metaph,"X",1); X else X if (( i < (L-1)) && (ename[i+1] == 'H')) X { X if (!((i > 0) && (ename[i-1] == 'T'))) X strncat(metaph,"O",1); X } X else if (!((ename[i+1] == 'C') && (ename[i+2] == 'H'))) X strncat(metaph,"T",1); X } X else if ( symb == 'V') X { X strncat(metaph,"F",1); X } X else if ((symb == 'W') || (symb == 'Y')) X { X if (( i < (L-1)) && (strrchr(VOWELS,ename[i+1]) != (char *)NULL)) X strncat(metaph,&symb,1); X } X else if ( symb == 'X') X { X if (strlen(metaph) < (metalen-1)) X strncat(metaph,"KS",2); X else X strncat(metaph,"K",1); X } X else if ( symb == 'Z') X { X strncat(metaph,"S",1); X } X } X} Xfree(ename); Xreturn; X X} X X Xmain(argc,argv) X X{ X X char name[30], metaph[8]; X int metalen=6; X X printf("Please enter metaphone size : "); X scanf("%d",&metalen); X fflush(stdin); X X if (metalen==0) X exit(1); X Xloop: X memset(metaph,'\0',sizeof(metaph)); X memset(name,'\0',sizeof(name)); X X printf("Please enter name: "); X gets(name); X X if (!strcmp(name,"")) X exit(0); X X metaphone(name,metaph,metalen); X Xprintf( "Metaphone code is : %-.*s\n", metalen, metaph); X /* printf( "Metaphone code is : %s\n", metaph);*/ X/* X printf("%s\n", metaph); X*/ X X goto loop; X X} X SHAR_EOF if [ `wc -c < metaphone.sf` -ne 9588 ] then echo "Lengths do not match -- Bad Copy of metaphone.sf" fi echo "Done." exit 0