Advanced Matching Algorithms
The match and merge uses advanced algorithms to enable more accurate grouping and scoring capabilities. Both these algorithms can be optionally configured when creating a matcher.
There are two methods for computing the confidence score of a group:
Direct Scoring: This method only takes into account the direct matches found with the rules. This group score is the average of these match scores. Pairs in the group that have not directly matched by any rule are considered as having a score of zero.
Transitive Scoring: This method takes into account the direct matches found with the rules, plus indirect transitive matches, which are computed. The group score is also the average of the match scores in the group. The major difference is that the pairs in the group that have not matched by any rule are considered as having a transitive score, which is the best combination of scores linking one record to the other.
Understand Transitive Scoring
Transitive scoring considers the match score as a probability of similarity between two records, which may combine with other probabilities:
For example, if A matches B with a probability of
0.5 and B matches C with a probability of
0.8, this method assumes that A matches C with a (combined) probability of
The following example illustrates direct vs. transitive scoring. In this example, the match rules have a match score of 90.
If using direct scoring:
Jane Smith (email@example.com) matches Jane Smith (firstname.lastname@example.org) according to the Same Name rule with a match score of 90
Jane Smith (email@example.com) matches Janet Jones (firstname.lastname@example.org) according to the Same Email rule with a match score of 90
Jane Smith (email@example.com) did not match Janet Jones (firstname.lastname@example.org) since they have a different name and email, so their direct match score is 0.
When computed with the direct scoring method, the confidence score of this group is:
(90 + 90 + 0) / 3 = 60
If using transitive scoring:
Jane Smith (email@example.com) did not match Janet Jones (firstname.lastname@example.org) since they have a different name and email, but since they are linked via the other record, their transitive match probability is
(.90*.90)=.81, which corresponds to a score of 81.
When computed with the transitive scoring method, the confidence score of this group is:
(90 + 90 + 81) / 3 = 87
Bear in mind that transitive scoring navigates all possible paths between the records, computes all possible combinations of scores and comes with the best transitive score.
A best transitive score may override a direct score. Two records may be poorly matched directly, but strongly related via other records.
When data stewards review matches using the graph available in the Explain Record view or in the duplicate managers, the best transitive matches appear as grey edges.
For performance reasons, the transitive scoring mode is automatically disabled for clusters larger than 300 records. These large clusters fallback on direct scoring.
The most simple method for merging groups into golden records consists of taking the initial largest match group that contains all records related by any match rule, compute the overall score for that group (either by direct or transitive scoring), and decide whether or not to merge that group.
However, in certain situations, you may want to consider possible sub-groups in the largest group.
Multi-iteration grouping provides the capability to process sub-groups within the coarse-grained group. It merges records by iterations, taking into account rules sharing the same score at each iteration, before moving to the next lower score. At each iteration, the merge policy thresholds apply to decide whether to create golden records or suggestions.
The following example illustrates a multi-iteration grouping.
In this example:
Janet Smith (email@example.com) matches J Smith (firstname.lastname@example.org) according to the Same Email rule with a match score of 95
Jean Smith (email@example.com) matches J Jones (firstname.lastname@example.org) according to the Same Email rule with a match score of 95
Janet Smith (email@example.com) matches Jean Smith (firstname.lastname@example.org) according to a Fuzzy Name rule with a match score of 15
The overall group score with direct scoring is (95+95+15)/3=68.3 but with transitive scoring, the group score is (95+95+81)/3=90.3
Intuitively, you would like to merge the first pair, then merge the second pair, since they each match very well. Then you would let the data steward decide whether the two resulting golden records should merge.
If you only rely on the overall group score, you cannot achieve that.
Multi-iteration grouping first takes into account only the rules sharing the highest score - 95 - and merge the groups.
Then it takes rules sharing the following score - 10 - and so on.
In the example, If you configure a merge policy to create golden records for scores greater or equal to 95, and have suggestions raised to the data stewards for groups with lower scores, you achieve the following:
Janet Smith (email@example.com) and J Smith (firstname.lastname@example.org) merge into a first golden.
Jean Smith (email@example.com) and J Jones (firstname.lastname@example.org) merge into a second golden
The two golden are then grouped in a suggestion.