The data certification process

Introduction to the certification process

The Certification Process creates consolidated and certified Golden Records from various sources:

Source Records, pushed into the hub by middleware systems on behalf of upstream applications (known as the Publishers).
Depending on the type of the entity, these records are either converted to golden records directly (basic entities) or matched and consolidated into golden records (ID and fuzzy matched entities). When matched and consolidated, these records are referred to as Master Records. The golden records they have contributed to create are referred to as Master Based golden records.
Source Authoring Record, authored by users in the MDM applications.
When a user authors data in an MDM application, depending on the entity type and the application design, he performs one of the following operations:
- He creates new golden records or updates existing golden records that exist only for the hub, and do not exist in any of the publishers. These records are referred to as Data Entry Based golden records This pattern is allowed for all entities, but basic entities support only this pattern.
- He creates or updates master records on behalf of publishers, submitting these records to matching and consolidation. This pattern is allowed only for ID and fuzzy matched entities.
- He overrides golden values resulting from the consolidation of records pushed by publishers. This pattern is allowed only for ID and fuzzy matched entities.
Delete Operations, made by users on golden and master records from entities with delete enabled.
Matching Decisions, taken by data stewards for fuzzy matched entities, using duplicates managers. Such decisions include confirming, merging or splitting groups of matching records as well as accepting/rejecting suggestions.

The certification process takes these various sources, applies the rules and constraints defined in the model in order to create, update or delete the golden data that business users browse using the MDM applications and that downstream application consume from the hub.

This process is automated and involves several phases, automatically generated from the rules and constraints, which are defined in the model based on the functional knowledge of the entities and the publishers involved.

The following sections describe the details of the certification process for ID, fuzzy matched and basic entities, and the delete process for all entities.

Rules involved in the process

The rules involved in the process include:

Enrichers: Sequence of transformations performed on the source and/or consolidated data to make it complete and standardized.
Data Quality Constraints: Checks carried out on the source and/or consolidated data to isolate or flag erroneous rows. These include Referential Integrity, Unique Keys, Mandatory attributes, List of Values, SemQL and Plug-in Validations
Matcher: This rule applies to Fuzzy Matched entities only. It is a set of match rules that bin (group) then match similar records to detect them as duplicates. The resulting duplicate clusters are merged (consolidated) and confirmed depending on their confidence score.
Survivorship Rule: This rule defines, for Fuzzy Matched and ID Matched Entities, how the golden record values are computed. It is composed of:
- A Consolidation Rule, defining how to consolidate values from duplicate records (detected by the matcher) into a single (golden) record.
- An Override Rule, defining how values authored by users override the consolidated value in the golden record.

Certification process for ID- and fuzzy-matched entities

The following figure describes the certification process and the various table structures involved in this process.

Figure 1. Certification process for matched entities

The certification process involves the following steps:

Enrich and Standardize Source Data: Source Authoring Records created or updated on behalf of publishers and Source Records are enriched and standardized using the SemQL and API (Java Plug-in or REST client) Enrichers, executed Pre-Consolidation.
Validate Source Data: The enriched and standardized records are checked against the various Constraints executed Pre-Consolidation. Erroneous records are ignored for the rest of the processing and the errors are logged

Note that source authoring records are enriched and validated only for basic entities. For ID and fuzzy matched, source authoring records are not enriched and validated.
Match and Find Duplicates: For fuzzy matched entities, this step matches pairs of records using a Matcher and creates groups of matching records (match groups). For ID matched entities, matching is simply made on the ID value.
The matcher works as follows:
- It runs a set of Match Rules. Each rule has two phases: first, a binning phase creates small bins of records. Then a matching phase compares each pair of records within these small bins to detects duplicates.
- Each match rule has a Match Score that expresses how strongly the pair of records matches. A pair of records that match according to one or more rules is given the highest Match Score of all these rules.
- When a match group is created, an overall Confidence Score is computed for that group. According to this score, the group is marked as a suggestion or immediately merged, and possibly confirmed. These automated actions are configured in the Merge Policy and Auto-Confirm Policy of the matcher.
- Matching Decisions taken by users on match groups are applied at that point, superseding the matcher’s choices.
Consolidate Data: This step consolidates match group duplicates into single consolidated records. The Consolidation Rules created in the Survivorship Rules defines how the attributes consolidate.
Enrich Consolidated Data: The SemQL and API (Java Plug-in or REST client) Enrichers executed Post-Consolidation run to standardize or add data to the consolidated records.
Publish Certified Golden Data: This step finally publishes the Golden Records for consumption.
- This step applies possible overrides from Source Authoring Record , according to the Override Rules defined in the Survivorship Rules.
- This step also creates or updates Data Entry Based golden records (that exist only in the MDM), from Source Authoring Records.
Validate Golden Data: The quality of the golden records is checked against the various Constraints executed on golden records (Post-Consolidation). Note that unlike the pre-consolidation validation, it does not remove erroneous golden records from the flow but flags them as erroneous. The errors are also logged .
Historize Data: Golden and master data changes are added to their history if historization is enabled.

Source Authoring Records are not enriched or validated for ID- and fuzzy-matched entities as part of the certification process. These records should be enriched and validated as part of the steppers into which users author the data.

Certification process for basic entities

The following figure describes the certification process and the various table structures involved in this process.

Figure 2. Certification process for basic entities

The certification process involves the following steps:

Enrich and Standardize Source Data: During this step, the Source Records and Source Authoring Records are enriched and standardized using SemQL and API (Java Plug-in and REST client) Enrichers executed Pre-Consolidation.
Validate Source Data: The quality of the enriched source data is checked against the various Constraints executed Pre-Consolidation. Erroneous records are ignored for the rest of the processing and the errors are logged .
Publish Certified Golden Data: This step finally publishes the Golden Records for consumption .
Historize Data: Golden data changes are added to their history if historization is enabled.

Note that:

Basic entities do not separate Source Records from Source Authoring Records. Both follow the same process.
Source data for basic entities does not pass through enrichers or validations executed post-consolidation.

Deletion process

A Delete Operation (for basic, ID or fuzzy matched entities) on a golden record involves the following steps:

Propagate through Cascade: Extends the deletion to the child records directly or indirectly related to the deleted ones with a Cascade configuration for Delete Propagation.
Propagate through Nullify: Nullifies child records related to the deleted ones with a Nullify configuration for the Delete Propagation.
Compute Restrictions: Removes from deletion the records having related child records and a Restrict configuration for the Delete Propagation. If restrictions are found, the entire delete is canceled as a whole.
Propagate Delete to owned Master Records to propagate deletion to the master records attached to deleted golden records. This step only applies to ID- and fuzzy-matched entities.
Publish Deletion: Tracks record deletion, with the record values for soft deletes only, and then removes the records from the golden and master data. When doing a hard delete, this step deletes any trace of the records in every table. The only trace of a hard delete is the ID (without data) of the deleted master and golden records. Deletes are tracked in the history for golden and master records, if historization is configured.

It is not necessary to configure in the job all the entities deletion should cascade to. The job generation automatically detects the entities that must be included for deletion based on the entities managed by the job.