Full and Incremental Duplicate Search Algorithm

To identify duplicates, the full and incremental search processes leverage the soundex phonetic algorithm and Damerau-Levenshtein distance algorithm to calculate the similarity between constituent records.

This is a summary of the matching process:

  1. The program prepares the results table. For a full duplicate search, all data is cleared from previous searches. For an incremental search, the program only clears constituents that no longer exist in the database. If Include inactive is not selected for the search process, the program also clears inactive constituents.

  2. The program creates groups of constituents to compare by running the soundex algorithm on the constituent key names (Last name for individuals or the Org/Group/Household name for organizations/groups/households). This algorithm finds names that sound the same but have minor differences in spelling. For example, the last names Hendericks, Hendershot, Henderson, and Hunter all have the soundex value of H536. They are included in the same group for comparison, while constituents with last names like Anderson, Johnson, Cooke, or Higgins are included in other groups. For a detailed explanation of the soundex algorithm, see http://en.wikipedia.org/wiki/soundex. The program divides these groups by constituent type (Individuals, Groups, and Organizations) and then compares each group separately.

    Note: Because the soundex algorithm matches records based on phonetic similarity, you should use articles at the beginning of organization names consistently—either always or never include them. For example, “The Boys and Girls Club” and “Boys and Girls Club” are not identified as duplicates by the full or incremental duplicate search processes due to the phonetic differences of the first word.

  3. The program then uses four calculations to compare the key name of each constituent to every other constituent in that group. For more information about those calculations, see Matching Score Calculations.

    If the key names on two constituents are similar enough to meet or exceed the match confidence threshold for names, the program then runs calculations for email addresses, phone numbers, and addresses:

    • If Match constituents based on email address is selected for the search process, the program also compares email addresses. If they match exactly, the program uses this calculation for the matching score: (1 + Name Score) / 2. For example, if the Name Score is .8, the result is (1 + .8) / 2 = .9 (90% match). If they do not match exactly, the score is zero.

    • If Match constituents based on phone number is selected for the search process, the program also compares phone numbers. If they match exactly, the program uses this calculation for the matching score: (1 + Name Score) / 2. For example, if the Name Score is .95, the result is (1 + .95) / 2 = .975 (98% match). If they do not match exactly, the score is zero.

    • If neither the email nor phone number is an exact match, the program compares addresses. If the addresses are similar enough to meet or exceed the match confidence threshold for addresses, the program uses this calculation for the matching score: (Name Score + Address Score) / 2.

  4. Constituents identified as duplicates are added to the table, while invalid matches are filtered from the results. Invalid matches include: constituents with a relationship to one another, constituents in the same Group/Household, and constituents that only matched themselves and no other constituents. If a constituent qualifies as a duplicate for multiple constituents, it is only matched to the constituent with the highest score.

  5. After scoring each match, the program creates matching groups that determine which constituents are merged and in what order. This can get complicated when you have four or more potential matches and not one of them matches every other record in the group. For example, a group includes: Mark P. Gardner (A), Mark Gardner (B), M P. Gardner (C), and M Gardner (D).

    The program designates Mark P. Gardner as the anchor record and compares it to every other record in the match group. Based on the matching scores, M P. Gardner and Mark Gardner match Mark P. Gardner. And M Gardner matches M P Gardner. Because M Gardner does not match the anchor record (Mark P. Gardner), the program excludes M Gardner and will not merge it with the others.