Data Certification

This document explains the jobs and the process involved to certify data published into the hub.

Integration Job

The integration job processes the data submitted by an external load and runs this data through the certification process. This process is a series of steps to create and certify golden data from this source data.

This job is automatically generated using the data quality rules that you define in the model. It uses the tables created in the data hub when you deploy a model edition.

Although a good understanding of the process is not required to publish source data or consume golden data, it is necessary to drill down into the various intermediate structures. For example, to review the rejects or the duplicates detected by the integration job for a given golden record.

An integration job is a sequence of tasks used to certify golden data for a group of entities. The model edition deployed in the data location brings several integration jobs definitions with it. Each of these job definitions is designed to certify data for a group of entities.

Integration jobs definitions, as well as integration job logs, are stored in the repository.

For example, a multi-domain hub contains entities for the PARTY and PRODUCTS domains, and has two integration jobs definition:

  • INTEGRATE_CUSTOMERS certifies data for the Party, Location, etc… entities.

  • INTEGRATE_PRODUCTS certifies data for the Brand, Product, Part, etc… entities.

Integration jobs run when source data has been loaded in the landing tables and is submitted for golden data certification.

Each integration job is the implementation of the overall certification process template. It may contain all or some of the steps of this process.

The Certification Process

The Certification Process creates consolidated and certified Golden Records from various sources:

  • Source Records, pushed into the hub by middleware systems on behalf of upstream applications (known as the Publishers).
    Depending on the type of the entity, these records are either converted to golden records directly (basic entities) or matched and consolidated into golden records (ID and fuzzy matched entities). When matched and consolidated, these records are referred to as Master Records. The golden records they have contributed to creating are referred to as Master Based golden records.

  • Source Authoring Record, authored by users in the MDM applications.
    When a user authors data in an MDM application, depending on the entity type and the application design, he performs one of the following operations:

    • He creates new golden records or updates existing golden records that exist only for the hub, and do not exist in any of the publishers. These records are referred to as Data Entry Based golden records This pattern is allowed for all entities, but basic entities support only this pattern.

    • He creates or updates master records on behalf of publishers, submitting these records to matching and consolidation. This pattern is allowed only for ID and fuzzy matched entities.

    • He overrides golden values resulting from the consolidation of records pushed by publishers. This pattern is allowed only for ID and fuzzy matched entities.

  • Delete Operations, made by users on golden and master records from entities with delete enabled.

  • Matching Decisions, taken by data stewards for fuzzy matched entities, using duplicates managers. Such decisions include confirming, merging, or splitting groups of matching records as well as accepting/rejecting suggestions.

The certification process takes these various sources, applies the rules and constraints defined in the model to create, update or delete the golden data that business users browse using the MDM applications and that downstream applications consume from the hub.

This process is automated and involves several phases, automatically generated from the rules and constraints, which are defined in the model based on the functional knowledge of the entities and the publishers involved.

The following sections describe the details of the certification process for ID, fuzzy matched and basic entities, and the delete process for all entities.

Certification Process for ID and Fuzzy Matched Entities

The following figure describes the certification process for ID and Fuzzy Matched Entities as well as the tables involved in this process.

Certification Process for Matched Entities
Figure 1. Certification process for matched entities

The certification process involves the following steps:

  1. Enrich and Standardize Source Data: Source Authoring Records (in the SA tables) created or updated on behalf of publishers and Source Records (in the SD tables) are enriched and standardized using the SemQL and API (Java Plug-in or REST Client) Enrichers, executed Pre-Consolidation.

  2. Validate Source Data: The enriched and standardized records are checked against the various Constraints executed Pre-Consolidation. Erroneous records are ignored for the rest of the processing and the errors are logged into the SE - source errors - and AE - authoring errors - tables.
    Note that source authoring records are enriched and validated only for basic entities. For ID and fuzzy matched, source authoring records are not enriched and validated.

  3. Match and Find Duplicates: For fuzzy matched entities, this step matches pairs of records using a Matcher and creates groups of matching records (match groups). For ID matched entities, matching is simply made on the ID value.
    The matcher works as follows:

    • It runs a set of Match Rules. Each rule has two phases: first, a binning phase creates small bins of records. Then a matching phase compares each pair of records within these small bins to detects duplicates.

    • Each match rule has a Match Score that expresses how strongly the pair of records matches. A pair of records that match according to one or more rules is given the highest Match Score of all these rules. Match pairs with scores and rules are stored in the DU table.

    • When a match group is created, a Confidence Score is computed for that group. According to this score, the group is marked as a suggestion or immediately merged, and possibly confirmed. These automated actions are configured in the Merge Policy and Auto-Confirm Policy of the matcher.

    • Matching Decisions taken by users on match groups (stored in the UM table) are applied at that point, superseding the matcher’s choices.

  4. Consolidate Data: This step consolidates match group duplicates into single consolidated records. The Consolidation Rules created in the Survivorship Rules defines how the attributes consolidate. Integration master records and integration golden (consolidated) records are stored at that stage in the GI and MI tables

  5. Enrich Consolidated Data: The SemQL and API (Java Plug-in or REST Client) Enrichers executed Post-Consolidation run to standardize or add data to the consolidated records.

  6. Publish Certified Golden Data: This step finally publishes the Golden Records for consumption. The final master and golden records are stored in the GD and MD tables

    • This step applies possible overrides from Source Authoring Record (Involving the SA, SF, GA and GF tables), according to the Override Rules defined in the Survivorship Rules.

    • This step also creates or updates Data Entry Based golden records (that exist only in the MDM), from Source Authoring Records.

  7. Validate Golden Data: The quality of the golden records is checked against the various Constraints executed on golden records (Post-Consolidation). Note that unlike the pre-consolidation validation, it does not remove erroneous golden records from the flow but flags them as erroneous. The errors are also logged (in the GE tables).

  8. Historize Data: Golden and master data changes are added to their history (stored in the GH and MH tables) if historization is enabled.

Source Authoring Records are not enriched or validated for ID and fuzzy matched entities as part of the certification process. These records should be enriched and validated as part of the steppers into which users author the data.

Certification Process for Basic Entities

The following figure describes the certification process for Basic Entities as well as the tables involved in this process.

Certification Process for Basic Entities
Figure 2. Certification process for basic entities

The certification process involves the following steps:

  1. Enrich and Standardize Source Data: During this step, the Source Records and Source Authoring Records (both stored in the SA tables) are enriched and standardized using SemQL and API (Java Plug-in and REST Client) Enrichers executed Pre-Consolidation.

  2. Validate Source Data: The quality of the enriched source data is checked against the various Constraints executed Pre-Consolidation. Erroneous records are ignored for the rest of the processing and the errors are logged (in the AE tables).

  3. Publish Certified Golden Data: This step finally publishes the Golden Records for consumption (in the GD tables).

  4. Historize Data: Golden data changes are added to their history (stored in the GH table) if historization is enabled.

Note that:

  • Basic entities do not separate Source Records from Source Authoring Records. Both follow the same process.

  • Source data for basic entities does not pass through enrichers or validations executed post-consolidation.

Deletion Process

A Delete Operation (for basic, ID, or fuzzy matched entities) on a golden record involves the following steps:

  1. Propagate through Cascade: Extends the deletion to the child records directly or indirectly related to the deleted ones with a Cascade configuration for Delete Propagation.

  2. Propagate through Nullify: Nullifies child records related to the deleted ones with a Nullify configuration for the Delete Propagation.

  3. Compute Restrictions: Removes from deletion the records having related child records and a Restrict configuration for the Delete Propagation. If restrictions are found, the entire delete is canceled as a whole.

  4. Propagate Delete to owned Master Records to propagate deletion to the master records attached to deleted golden records. This step only applies to ID and fuzzy matched entities.

  5. Publish Deletion: Tracks record deletion (in the GX and MX tables), with the record values for soft deletes only, and then removes the records from the golden and master data (in the GD and MD tables).
    When doing a hard delete, this step deletes any trace of the records in every table (SA, SD, UM, MH, GH, etc.). The only trace of a hard delete is the ID (without data) of the deleted master and golden records, in the GX and MX tables.
    Deletes are tracked in the history for golden and master records (in the MH and GH tables), if historization is configured.

It is not necessary to configure in the job all the related entities that may be deleted by cascade. The job generation automatically detects the entities that must be included for deletion based on the entities managed by the job.

Refer to data certification for more information about the model rules and artifacts involved in the generation of this process.