High Programmer > Alan De Smet > Unique ID > What is a Soundex code?

What is a Soundex code?

Soundex is a hashing system for english words. From an english word, you generate a letter and three numbers. that roughly describe how an given word sounds. Similar sounding words will have similar codes. It might be used, for example, by 411 (phone information), to look up other spellings of a last name. It was used by the United States Census Bureau to find similar names in census records. Soundex was created by Robert C. Russell of Pittsburgh, Pennsylvania. He received U.S. patent 1,261,167 on April 2, 1918 on for it. The U.S. Patent and Trademark office has the original Soundex Patent (1,261,167) online). You might be interested in a history of various different versions of the Soundex coding system.

You can see soundex in action here.

A bit of warning about Soundex codes: although in theory they should always be the same for a given name, in practice they sometimes vary. There are a number of reasons. Sometimes implementations of the algorithm have bugs that only become apparent in a small number of cases. (I've seen a number of implementations with bugs.) Sometimes last names are entered into the system incorrectly (various computer systems think my last name is "Smet", "De", or "Desmet", which map to S530, D000, and D253 respectively). In addition, the Soundex system is really english oriented. There is no support for characters beyond the 26 letters used in the english language. As a result, names with unusual letters (like æ, ø, or Ð) are sometimes encoded different ways by different people and programs.

Are you considering using Soundex for anything important? You might want to think again. Soundex is actually a pretty poor algorithm for doing fuzzy name comparisons. The specification has always been a bit fuzzy, so a single name might have different encodings depending on who did it. You might want to look at "Considering a Soundex-based Solution for an Important Application?" with it's "10 major problems with Soundex and other key-based name match solutions." You might also want to look at "Cracking the Soundex Code" which lists some of the problems with using Soundex for looking for geneology records.

The first letter is simply the first letter in the word. The remaining numbers range from 1 to 6, indicating different categories of sounds created by consanants following the first letter. If the word is too short to generate 3 numbers, 0 is added as needed. If the generated code is longer than 3 numbers, the extra are thrown away.

CodeLettersDescription
1B, F, P, VLabial
2C, G, J, K, Q, S, X, ZGutterals and sibilants
3D, TDental
4LLong liquid
5M, NNasal
6RShort liquid
SKIPA, E, H, I, O, U, W, YVowels (and H, W, and Y) are skipped

There are several special cases when calculating a soundex code:

Sample Soundex Codes

WordSoundex
WashingtonW252
WuW000
DeSmetD253
GutierrezG362
PfisterP236
JacksonJ250
TymczakT522
AshcraftA261

By taking a soundex code and guessing with common letters, you can take a guess at the sound of the word. By comparing the word to a list of known soundex codes, you can guess at common words. My program does both of these.

Source Code


Additional details on the Soundex system came from "The Soundex Machine" at the National Archives and Records Administration.

Contact webmaster