Integration Job

The integration job processes the data submitted in an external load and runs this data through a series of steps to create and certify golden data out of this source data.

This job is generated from the integration rules defined at design time, and it uses the data structures automatically created in the MDM hub when deploying the model edition.

Tip: Although understanding the details of the process is not needed for publishing source data or consuming golden data, it is necessary to have complete understanding of this process to drill down into the various structures between the source and the golden data. For example, to review the rejects or the duplicates detected by the integration job for a given golden record.

Certification Process Overview

Each integration job is the implementation of the overall certification process template. It may contain all or some of the phases of this process. The following section details the various phases of the certification process.

The following figure describes the Golden Data certification process and the various data structures involved in this process.

  1. Enrichment : This phase of the process enriches and standardizes source data using SemQL or plug-ins. This phase updates the data in the Source Data (SD) tables.
  2. Validation (Pre-Consolidation): This phase checks the enriched source data from the Source Data (SD) tables against the rules and constraints defined in the model, using SemQL or plug-ins. The records not conforming to the rules are rejected from the flow, and the errors (with the erroneous data) are tracked in a Source Error (SE) tables. Valid records are moved to the Master Integration (MI) tables.
  3. Matching: This phase creates matching groups according to the SemQL matching expression. Note that this phase runs in two steps. First, a binning step creates – using the binning expression – small sets of records into which the matching – using the SemQL matching condition – takes place. This phase involves the Master Integration (MI) tables where duplicate records are flagged.
  4. Consolidation: This phase creates (consolidates) a single record out of the various duplicates of a group, using the field-level or record-Level consolidation rules. This phase consolidates the data from the Master Integration (MI) tables in the Golden Integration (GI) tables.
  5. Validation (Post-Consolidation): This phase checks the Golden Integration (GI) - candidate golden – records against the rules and constraints defined in the model, using SemQL or plug-ins. The records not conforming to the rules are rejected from the flow, and the errors (with the erroneous data) are tracked in a Post Consolidation Error (or Golden Error - GE) tables.
  6. Merge Golden: Finally, Golden Integration (GI) records as well as the Master Integration (MI) records are merged in the Golden Data (GD) and Master Data (MD) tables.

For more information about these various phase and how the certification process is designed, refer to the "Integration Process Design" chapter in the "Semarchy Convergence for MDM Developer’s Guide".

Tables Structures

This section describes the general structure of the tables involved in the integration process.

Tables

For each entity, a pre-defined set of tables is created. These tables are named with the following convention: <Prefix>_<Physical Table Name>. The prefix is automatically generated depending on the table nature (Source Data, Source Error, etc.), and the Physical Table Name is set in the entity definition.

The following list describes the tables created for a given entity.

Table Name Description
SD_<Physical Table Name> Source Data This table is loaded by the middleware to publish data into the hub.
SE_<Physical Table Name> Source Errors This table contains the errors detected during the pre-consolidation validation phase.
MI_<Physical Table Name> Master Integration This table contains the records that have passed the pre-consolidation validation phase and that have been grouped by the matching phase.
GI_<Physical Table Name> Golden Integration This table contains the records that have been consolidated but not yet passed through the post-consolidation validation phase.
GE_<Physical Table Name> Golden Errors This table contents the errors detected during the post-consolidation validation phase.
MD_<Physical Table Name> Master Data This table contains the master records, which are the enriched and validated source records that have participated in the creation of golden records. This table can be accessed to review groups/duplicates or to refer to the source data.
GD_<Physical Table Name> Golden Data This table contains the golden records consolidated from the master records.

For publishing and consuming data, the MI and GI tables are not needed. The structure of these tables is not explained in this document.

Columns

Note About Date Values and Time Zones: Date values are stored, read and written in the time zone of the application server. When consuming dates and timestamp values, time zone conversion may be necessary if the client time zone differs from the application server time zone. Similarly, integration processes must convert dates and timestamp values in the application server’s time zone.

Attribute Columns

Attributes appear in the tables' structure as follows:

For example: The Country ( COUNTRY) and City ( CITY) definition attributes of the Geocoded Address complex attribute (Prefix: GEO) will be mapped to columns named GEOCOUNTRY and GEOCITY.

Built-in Columns

In addition to the attribute columns, built-in columns are added to the tables' structure, and are used in the integration process.
They track for example:

The following list describes these columns.

Column Name Datatype Exists in Tables Description
B_LOADID NUMBER(38,0) SD ID of the load that created this record.
B_BATCHID NUMBER(38,0) SD, SE, MD, GD, GE ID of the batch that created this record.
B_PUBID VARCHAR2(30 CHAR) SD, SD, MD Code of the publisher that published this record.
B_SOURCEID VARCHAR2(128 CHAR) SD, SD, MD ID of the source record in the source publisher system (Fuzzy Matching entities only). See below for a detailed explanation of the primary key options.
B_BRANCHID NUMBER(38,0) SD, SE, MD, GD, GE Data Branch containing the data edition(s) storing this record.
B_CLASSNAME VARCHAR2(128 CHAR) SD, SE, MD, GD, GE Class name of this record. See below for a detailed explanation of class name and inheritance.
B_CONSTRAINTNAME VARCHAR2(128 CHAR) SE, GE Name of the constraint causing this error (Error Records)
B_CONSTRAINTTYPE VARCHAR2(30 CHAR) SE, GE Type of the constraint causing this error (Error Records)
B_FROMEDITION NUMBER(38,0) MD, GD Edition from which the Master/Golden record exists.
B_TOEDITION NUMBER(38,0) MD, GD Edition until which the Master/Golden record exists. If it is null, then the record exists until the latest data edition.
B_MATCHGRP NUMBER(38,0) MD ID of the match group for the master record. This column is set when matching takes place.
B_MASTERSCOUNT NUMBER(38,0) GD Number of master records contributing to this golden record.
B_IS_VALIDATED VARCHAR2(1) GD Flag indicated whether this golden record has been validated and the matching confirmed by a user (Fuzzy Matching entities only).
B_ORIGINALBATCHID NUMBER(38,0) SD Batch identifier of the record when it was originally checked out in a workflow.
B_CREDATE TIMESTAMP(6) SD, SE, MD, GD, GE Creation date of the record.
B_UPDDATE TIMESTAMP(6) SD, SE, MD, GD, GE Last update date of the record.
B_CREATOR VARCHAR2(128 CHAR) SD, SE, MD, GD, GE Creator of the record.
B_UPDATOR VARCHAR2(128 CHAR) SD, SE, MD, GD, GE Last updater of this record.

Primary Keys Columns

The primary key to load depends on the type of the matching behavior of the entity:

ID Matching

When using ID Matching, we assume a common identifier across all systems. In this case, this common identifier is stored in a column named after the Physical Column Name specified in the primary key attribute definition.

This column will exist in all tables. When publishing data into the hub, the middleware loads this column with the primary key from the publishing system. This identifier is simply propagated into the hub, and matching is done using this primary key.

Fuzzy Matching

When using Fuzzy Matching, we assume no common identifier across publishers. There may be two different records with the same ID in different systems. There is a need to match the records and consolidate them under a golden record having a primary key generated by the system.

In this case, the source identifier is stored in the column named B_SOURCEID, along with the publisher code stored in the B_PUBID. This is the case for the SD, SE and MD tables.

When publishing data into the hub, the middleware loads this B_SOURCEID column with a primary key value from the publishing system. If this primary key is a composite key in the source system, all the columns of this composite key must be concatenated into B_SOURCEID.

When the records are consolidated in a golden record (GD and GE tables), a System Defined Primary Key is generated and stored in a column named after the Physical Column Name specified in the primary key attribute definition. This key is referred to as the Golden Record ID.

The MD table makes the bridge between the Source ID and the Golden ID as it contains both these values.

Reference Columns

When a reference exists in the source publisher and need to be expressed in the landing table, this reference mapping in the table structure depends on the matching behavior of the referenced entity.

Reference to an ID Matching Entity

For a reference to an ID Matching entity, the referenced key is the same for all systems. As a consequence, the referenced value is stored in a single column. This column is named after the Physical Name provided in the reference definition and is prefixed with F_. For example, F_EMPLOYEE.

For example: if Customer references Employee and this entity uses ID Matching, SD_CUSTOMER will contain the following information.

CUSTOMER_NAME F_EMPLOYEE
Gadgetron 11
Roxxon 56

This information means that the customer “Gadgetron” references the employee number 11, and “Roxxon” references employee number 56. This employee number is the same in all systems.

Reference to a Fuzzy Matching Entity

For a reference to a Fuzzy Matching entity, the referenced key is per publisher.

As a consequence:

For example: if Contact references Customer and this entity uses Fuzzy Matching, SD_CONTACT will contain the following information.

FIRST_NAME LAST_NAME FP_CUSTOMER FS_CUSTOMER
John Doe CRM 1235
Jane Smith MKT A3251

This information means that the contact “John Doe” references the customer with the primary key “1235” in the “CRM” publisher, and that “Jane Smith” references the customer with the primary key “A3251” in the “MKT” publisher.

Class Name and Inheritance

Several entities involved in an inheritance relation have their data stored in the same set of tables. These tables store the superset of the attributes of the parent and all its child entities. The B_CLASSNAME column is used to identify the class (entity) of a record in a table.

For example, when Person and Company inherit from the Party entity, the resulting table is named after the parent entity ( Party), and will contain all the attributes of Person and Company as well. In the GD_PARTY table for example, records representing persons will have B_CLASSNAME='Person'. When publishing person or company information in the SD_PARTY table, the middleware must set B_CLASSNAME='Person' or B_CLASSNAME='Company' accordingly.

Constraints

The only constraints enforced in the tables are the primary keys and the not null columns. These constraints apply to system columns only.
For example, on an SD_ table, the following constraints are enforced:

Other constraints defined in the model (mandatory attributes, references, etc.) are not enforced in the physical model but checked during the validation phase.