You are here: Find Duplicate Constituents > Find Duplicate Constituents > Full and Incremental Duplicate Search Processes > Full and Incremental Duplicate Search Algorithm

Full and Incremental Duplicate Search Algorithm

To identify duplicates, the full and incremental search processes leverage the Soundex phonetic algorithm and Damerau-Levenshtein distance algorithm to calculate the similarity between constituent records.

This is a summary of the matching process:

  1. The program prepares the results table. For a full duplicate search, all data is cleared from previous searches. For an incremental search, the program only clears constituents that no longer exist in the database. If Include inactive is not selected for the search process, the program also clears inactive constituents.

  2. The program creates groups of constituents to compare by running the Soundex algorithm on the constituent key names (Last name for individuals or the Org/Group/Household name for organizations/groups/households). This algorithm finds names that sound the same but have minor differences in spelling. For example, the last names Hendericks, Hendershot, Henderson, and Hunter all have the SOUNDEX H536. They are included in the same group for comparison, while constituents with last names like Anderson, Johnson, Cooke, or Higgins are included in other groups. For a detailed explanation of the Soundex algorithm, see http://en.wikipedia.org/wiki/Soundex. The program divides these groups by constituent type (Individuals, Groups, and Organizations) and then compares each group separately.

    Note: Because the Soundex algorithm matches records based on phonetic similarity, you should use articles at the beginning of organization names consistently—either always or never include them. For example, “The Boys and Girls Club” and “Boys and Girls Club” are not identified as duplicates by the full or incremental duplicate search processes due to the phonetic differences of the first word.

  3. The program then uses four calculations to compare the key name of each constituent to every other constituent in that group. For more information about those calculations, see Matching Score Calculations.

    If the key names on two constituents are similar enough to meet or exceed the match confidence threshold for names, the program then runs calculations for email addresses, phone numbers, and addresses:

  1. Constituents identified as duplicates are added to the table, while invalid matches are filtered from the results. Invalid matches include: constituents with a relationship to one another, constituents in the same Group/Household, and constituents that only matched themselves and no other constituents. If a constituent qualifies as a duplicate for multiple constituents, it is only matched to the constituent with the highest score.

  2. After scoring each match, the program creates matching groups that determine which constituents are merged and in what order. This can get complicated when you have four or more potential matches and not one of them matches every other record in the group. For example, a group includes: Mark P. Gardner (A), Mark Gardner (B), M P. Gardner (C), and M Gardner (D).

    The program designates Mark P. Gardner as the anchor record and compares it to every other record in the match group. Based on the matching scores, M P. Gardner and Mark Gardner match Mark P. Gardner. And M Gardner matches M P Gardner. Because M Gardner does not match the anchor record (Mark P. Gardner), the program excludes M Gardner and will not merge it with the others.

Matching Score Calculations

To calculate match confidence scores, the program uses the Damerau-Levenshtein distance algorithm to make four comparisons.

  1. The first calculation uses the Damerau-Levenshtein distance algorithm to compare strings (full name, address, email, and phone) and identify the number of changes necessary to make the strings match exactly. Changes include adding, removing, or changing characters and transposing a pair of characters. For example, if you compare "Christopher" vs. "Chrsitopher," you would need to transpose the "si" in "Chrsitopher" to fix the misspelling (one change). To learn more about this algorithm, see http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance.

  2. The second calculation breaks each string (email, phone, and address) into individual words and applies the Damerau-Levenshtein distance algorithm to each word pair on the matching constituents. It then averages the scores calculated for all words pairs. If a string on one constituent contains all the words of the string of a matching constituent, the words are considered exact matches. For example, “32 Broad St” and “32 Broad” are matches.

  3. The third calculation breaks each string (full name, address, email, and phone) into words and each word into groups of four characters (q-grams) and then applies the Damerau-Levenshtein distance algorithm to each q-gram on the matching constituents. It then averages the scores calculated for all q-grams. To learn more about q-grams, see http://www.sound-ex.com/alternative_qgram.htm.

  4. The final calculation breaks the string (full name, address, email, and phone) into q-grams without first breaking it into words and then applies the Damerau-Levenshtein distance algorithm to each q-gram on the matching constituents.

The final match confidence score equals the highest score of those four calculations. Some important notes to consider: