Match and merge

The match-and-merge process detects the duplicates in order to consolidate them into a single golden record.

Capabilities per entity type

Matching works differently for fuzzy-matched and ID-matched entities.

  • Fuzzy-matched entities use a matcher to automatically detect duplicates records and group them into clusters. It consolidates their values into a golden records using survivorship rules.

  • ID-matched entities perform an exact match on the user-provided ID value, as this ID is a primary key that is unique across all systems. This type of entity does not need a matcher but uses survivorship rules to merge the values into a golden record.

  • Basic entities do not support match and merge.

Basic and ID-matched entities do not support fuzzy matching in the certification process.
However, you still can define matchers for these entities. Such matchers are used only to detect duplicates at record creation time. When a user creates a new record, and the DETECT_DUPS validation is active, the matcher is used to search for similar records in order to warn the user when a new entry matches existing records.

ID- vs. fuzzy-matching

When using ID matching, the outcome of the matching process is extremely predictable. Indeed, all records with the same ID will match and merge.

When using fuzzy matching, the outcome of the matching process is more complex and flexible, as it is based on fuzzy matching rules and algorithms that depend on the data.

Creating match groups

The first phase of match and merge is to detect duplicate records and create match groups (duplicate clusters) that will be later consolidated (merged) into golden records.

For ID-matching entities, this phase does not apply since the records are grouped by their ID.

Matching record pairs

First, the matching must detect the duplicates—that is, pairs of records that seem similar.

Match rules in a matcher define the conditions for considering two records a match. These rules have produce a matching score, which represents the percentage of confidence you put in a match that occurs thanks to a rule.

Match rules and scores

For example, the following rule (MATCH_RULE_1) defines two businesses that are exactly the same, and may have a score of 100:

Record1.CustomerName = Record2.CustomerName and
Record1.InputAddress.Address = Record2.InputAddress.Address and
...
Record1.InputAddress.City = Record2.InputAddress.City

The following rule (MATCH_RULE_2) would have for example a score of 85, as it detects businesses that have large number of similarities:

SEM_EDIT_DISTANCE_SIMILARITY( Record1.CustomerName, Record2.CustomerName ) > 85 and
SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.Address, Record2.InputAddress.Address ) > 65 and
SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.City, Record2.InputAddress.City ) > 65

The following rule (MATCH_RULE_3) would have a score of 20, as it detects two businesses with vaguely similar names in the same city:

SEM_EDIT_DISTANCE_SIMILARITY( Record1.CustomerName, Record2.CustomerName ) > 50
Record1.InputAddress.City = Record2.InputAddress.City

When two records match, they receive a match score equal to the highest score of all the rules that matched them (highest confidence).

For example, two records matching by the MATCH_RULE_2 (score 85) and MATCH_RULE_3 (score 20) would have a match score of 85 (the highest of 85 and 20)

Creating match groups

Match groups re-group duplicate records that have matched.

The default and initial "coarse-grained" grouping mechanism is transitive. If groups all records related by a match rule.
Think about this grouping mechanism as follows: if A matches B and B matches C, then A, B, and C are in the same match group.

Grouping transitivity example

The following example shows how transitivity applies to the grouping:

Match Group

In this example:

  • Jane Smith (j.smith@acme.com) matches Jane Smith (jane@goliath.com) according to the Same Name rule,

  • Jane Smith (jane@goliath.com) matches Janet Jones (jane@goliath.com) according to the Same Email rule,

  • Jane Smith (j.smith@acme.com) does not match Janet Jones (jane@goliath.com) since they have a different name and email.

Jane Smith (j.smith@acme.com), Jane Smith (jane@goliath.com) and Janet Jones (jane@goliath.com) are in the same initial match group due to the grouping transitivity.

The multi-iterating grouping algorithm provides a more sophisticated method for creating clusters from the original coarse-grained grouping mechanism.

Confidence score

A match group has a confidence score that expresses how confident you can be with that group of matching records. This score is computed from scores of the match pairs that compose the group.

There are two methods for computing the confidence score of a group, direct and transitive scoring, which affect the resulting confidence score of the clusters

Merging groups into golden records

Depending on the confidence score, you may let the matcher automatically merge the match group. This operation creates a golden record from the group, then applies the consolidation rule to the group of records to define the values that are consolidated in the golden record.

If a match group is not merged automatically, because its confidence score is not high enough, it is flagged as a merge suggestion. In that case:

  • Incoming records are kept as singleton golden records.

  • Existing groups of records and golden records remain untouched.

Merge suggestions are reviewed by data stewards with duplicate management actions, to decide whether or not to merge groups and create golden records.

The merge policy set when creating a matcher defines the confidence scores required to automatically merge groups in a variety of cases. See Automate merge and confirmation for more information about the merge policy.

Confirming golden record

As the values change in the source records, the match groups and golden records may change.

Value changes impact on unconfirmed records

Renaming a business from "Micro Soft Incorporated" to "Micro Soft" is likely to have it match and merge with an existing "Microsoft" record, if we fuzzy-match by business name. The original "Micro Soft Incorporated" golden record would then cease to exist, as it would be merged within the "Microsoft" record.

Confirming a golden record consists in "freezing" the match group to avoid having it reconsidered every time data changes.

This is typically done by a data steward with a duplicate management action. The steward manually confirms the correct match groups and fixes the incorrect match groups.

Depending on the confidence score computed for a match group, you may also want to automatically confirm the golden record to avoid having the steward reviewing all the data.

The auto-confirm policy set when creating a matcher defines the confidence score required to automatically confirm golden records.

It also allows you to set whether singletons (golden records composed of a single master record) should be automatically confirmed.

Over time, records may have the following confirmation status:

  • Not confirmed: the golden record was never confirmed by the matcher or a user.

  • Confirmed: the golden record was entirely confirmed by the matcher or a user.

  • Partially confirmed: part of the match group that composes the golden record was confirmed by a user, but some masters in this match group are still not marked as confirmed.

  • Previously confirmed: a record that was confirmed but which group has been modified by a user.

Regardless of the confirmation status of a record, a data steward is always able, with a duplicate management action, to manually split and merge duplicates.

Consolidating and overriding values

Once multiple master records are merged in a golden record, the data from these records is consolidated into the golden record, possibly applying overrides—that is, values entered by data stewards on the data hub.

Survivorship indicates which data survives in the golden records when multiple records merge.

A survivorship rule defines, for attributes in both fuzzy-matched and ID-matched entities, how golden record values are computed. It is composed of:

  • A consolidation rule, defining how to consolidate values from duplicate records (detected by the matcher) into a single (golden) record.

  • An override rule, defining how values possibly authored by users override the consolidated value in the golden record.

A survivorship rule applies to a single attribute or to a set of attributes of a given entity. See Survivorship for more information about survivorship rules.

Golden records changes

After a golden record is created, it may be automatically joined by new master records, merged with another golden record, or become part of a suggestion. These changes in the groups may cause the consolidation/survivorship process to run and alter the values of the golden record.

The golden record life cycle is automatically handled according to the matcher and survivorship rules configured for the entity.

The automated matching process aggregates groups of records, but does not automatically split existing groups.

Data stewards have the possibility to use duplicate managers to manually review, confirm, merge, split match groups and suggestions. They can also override values for golden record values, according to the override rules.