Searching text using the SoundEx algorithm

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. It can be a useful way to store names for searching so that similar names can be found. As defined by Wikipedia:

The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software, although not included with SQLiteDatabase).

SoundEx has these rules: 1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.

  1. Replace consonants with digits as follows (after the first letter):

    • b, f, p, v → 1

    • c, g, j, k, q, s, x, z → 2

    • d, t → 3

    • l → 4

    • m, n → 5

    • r → 6

  2. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.

  3. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

This is a Xojo SoundEx function:

Public Function SoundEx(word As String) As String
  Const kLength As Integer = 4

  Var value As String

  Var size As Integer = word.Length

  // Make sure the word is at least two characters in length
  If (size > 1) Then
    word = word.Uppercase

    // Convert the word to a character array for faster processing
    Var chars() As String = word.Split

    // For storing the SoundEx character codes
    Var code() As String

    // The current and previous character codes
    Var prevCode As Integer = 0
    Var currCode As Integer = 0

    // Add the first character
    code.AddRow(chars(0))

    Var loopLimit As Integer = size - 1
    // Loop through all the characters and convert them to the proper character code
    For i As Integer = 0 To loopLimit
      Select Case chars(i)
      Case "H", "W"
        currCode = -1
      Case "A", "E", "I", "O", "U", "Y"
        currCode = 0
      Case "B", "F", "P", "V"
        currCode = 1
      Case "C", "G", "J", "K", "Q", "S", "X", "Z"
        currCode = 2
      Case "D", "T"
        currCode = 3
      Case "L"
        currCode = 4
      Case "M", "N"
        currCode = 5
      Case "R"
        currCode = 6
      End Select

      If i > 0 Then
        // two letters With the same number separated by 'h' or 'w' are coded as a single number
        If currCode = -1 Then currCode = prevCode

        // Check to see if the current code is the same as the last one
        If currCode <> prevCode Then
          // Check to see if the current code is 0 (a vowel); do not proceed
          If currCode <> 0 Then
            code.AddRow(currCode.ToString)
          End If
        End If
      End If

      prevCode = currCode

      // If the buffer size meets the length limit, then exit the loop
      If (code.LastRowIndex = kLength - 1) Then
        Exit For
      End If
    Next

    // Pad the code if required
    size = code.LastRowIndex + 1
    For j As Integer = size To kLength - 1
      code.AddRow("0")
    Next

    // Set the return value
    value = String.FromArray(code, "")
  End If

  // Return the computed soundex
  Return value
End Function

You call the SoundEx function like this:

Var result As String
result = SoundEx("Robert") // R163
result = SoundEx("Rupert") // R163
result = SoundEx("Rubin") // R150
result = SoundEx("Ashcraft") // A261
result = SoundEx("Ashcroft") // A261
result = SoundEx("Tymczak") // T522
result = SoundEx("Pfister") // P236

See also

Full Text Searching topic