Integration Process Design

Introduction to the Integration Process

The Integration Process transforms Source Records pushed in the hub by several Publishers into consolidated and certified Golden Records. This process is automated and involves several phases, generated from the rules and constraints defined in the model. The rules and constraints are defined in the model based on the functional knowledge of the entities and the publishers involved.

Integration Process Overview

The integration process involves the following steps:

  1. Enrichment: During this step, the source data is enriched and standardized using SemQL and Plug-in Enrichers.
  2. Pre-Consolidation Validation: The quality of the enriched data is checked against the various Constraints to be executed Pre-Consolidation.
  3. Matching: This process runs in two phases: first, a binning phase creates small groups of records and a matching phase performs the matching within these smaller bins and detects duplicates.
  4. Consolidation: This process consolidates duplicates detected in the matching phase in a single record. It performs field-level or record-level consolidation.
  5. Post-Consolidation Validation: This process is similar to the pre-consolidation validation, but is executed on the consolidated records.

Rules Involved in the Process

These rules involved in the process include:

Integration Jobs

When all the rules are defined, one of more Integration Jobs can be defined for the model.
The integration job will run to perform the integration process, using the hub’s database engine for most of the processing (including SemQL processing) and Semarchy Convergence for MDM for running the plug-ins code.

Integration jobs are triggered to integrate data published in batch by data integration/ETL tools, or to process data handled by users in human workflows.

When pushing data in the hub, a data integration or ETL product performs the following:

  1. It requests a Load ID to identify the data load and initiate a transaction with Semarchy Convergence for MDM.
  2. It loads data in the landing tables of Semarchy Convergence for MDM, possibly from several sources identified as Publishers.
  3. It submits the load identified by the Load ID, and when submitting the load, it provides the name of the Integration Job that must be executed to process the incoming data.

Similarly, when a user starts a human workflow for data entry and duplicate management:

  1. A transaction is created and attached to the workflow instance and is identified by a Load ID.
  2. The user performs the data entry and duplicate management operations in the graphical user interface. All the data manipulations are performed within the transaction.
  3. When the activity is finished, the transaction is submitted. This triggers the Integration Job specified in the workflow definition.

Human Workflows

Human Workflows allows users to perform data entry or duplicate management operations in the MDM hub:

When a human workflow completes, an integration job is triggered to process the data entered or the duplicate management choices made by the user.

Publishers

Publishers are application and users that provide source data to the MDM Hub. They identify themselves using a code when pushing batches of data.
The publisher does not represent the technical provider of the data (the ETL or Data Integration product), but the source of the data (The CRM Application, the Sales Management system, etc.). Examples of publishers: CRM, Sales, Marketing, Finance, etc.

Consolidation performs certain choices depending on the publishers, and the publishers are tracked to identify the origin of the golden data certified by Semarchy Convergence for MDM.

Note: Identifying clearly and declaring the publishers is important in the design of the integration process. Make sure to identify the publishers when starting an MDM project.

Important: As a general rule, use dedicated publishers for data entry operations. Such publisher can be used as a preferred publisher for all consolidation rules to enable the precedence of user entered data over application provided data.

To create a publisher:

  1. Right-click the Publishers node and select Add Publisher.... The Create New Publisher wizard opens.
  2. In the Create New Publisher wizard, check the Auto Fill option and then enter the following values:
  3. Active: Check this box to make this publisher active. An inactive publisher is simply declared but not used in the consolidation rules.
  4. Click Finish to close the wizard. The Publisher editor opens.
  5. In the Description field, optionally enter a description for the Publisher.
  6. Press CTRL+S to save the editor.
  7. Close the editor.

Enrichment

Enrichers normalize, standardize and enrich source data (attribute values) pushed by the Publishers in the hub

Enrichers have the following characteristics:

There are two types of enrichers:

Creating SemQL Enrichers

A SemQL Enricher enriches several attributes of an entity using attributes from this entity, transformed using SemQL expressions and functions.
You will find SemQL examples for enrichers in the Introduction to the Semarchy Workbench chapter.

To create a SemQL enricher:

  1. Expand the entity node, right-click the Enrichers node and select Add SemQL Enricher.... The Create New SemQL Enricher wizard opens.
  2. In the Create New SemQL Enricher wizard, check the Auto Fill option and then enter the following values:
  3. Click Next.
  4. In the Enricher Expressions page, select the Available Attributes that you want to enrich and click the Add >> button to add them to the Used Attributes.
  5. Click Next.
  6. Optionally click the Edit Expression button to open the expression editor to define a filter. The enricher will only enrich those of the records respecting this filter. Skip this task if you want to enrich all the records.
  7. Click Finish to close the wizard. The SemQL Enricher editor opens.
  8. In the Description field, optionally enter a description for the SemQL Enricher.
  9. Set the enricher expressions:
    1. In the Enricher Expressions table, select the Expression column for the attribute that you want to enrich and then click the Edit Expression button. The SemQL editor opens.
    2. Create a SemQL expression to load the attribute to enrich, and then click OK to close the SemQL Editor. This expression may use any attribute of the current entity.
    3. Repeat the previous steps to set an expression for each attribute to enrich.
  10. Press CTRL+S to save the editor.
  11. Close the editor.

Creating Plug-in Enrichers

A Plug-in Enricher enriches several attributes of an entity using attributes from this entity, transformed using a plug-in developed in Java.
A plug-in enricher takes

It returns a list of Plug-in Outputs which must be mapped on the entity attributes.

Attributes are mapped on the input to feed the plug-in and on the output to enrich the entity with the resulting data transformed by the plug-in.

Note: Before using a plug-in enricher, make sure the plug-in was added to the platform by the administrator. For more information, refer to the "Semarchy Convergence for MDM Administration Guide".

To create a plug-in enricher:

  1. Expand the entity node, right-click the Enrichers node and select Add Plug-in Enricher.... The Create New Plug-in Enricher wizard opens.
  2. In the Create New Plug-in Enricher wizard, check the Auto Fill option and then enter the following values:
  3. Click Next.
  4. Optionally click the Edit Expression button to open the expression editor to define a filter. The enricher will only enrich those of the records respecting this filter. Skip this task if you want to enrich all the records.
  5. Click Finish to close the wizard. The Plug-in Enricher editor opens. The Plug-in Params, Plug-in Inputs and Plug-in Outputs tables show the parameters and inputs/outputs for this plug-in.
  6. You can optionally add parameters to the Plug-in Params list:
    1. In the Plug-in Params table, click the Define Plug-in Parameters button.
    2. In the Parameters dialog, select the Available Parameters that you want to add and click the Add >> button to add them to the Used Parameters.
    3. Click Finish to close the dialog.
  7. Set the values for the parameters:
    1. Click the Value column in the Plug-in Params table in front a parameter. The cell becomes editable.
    2. Enter the value of the parameter in the cell, and then press Enter .
    3. Repeat the previous steps to set the value for the parameters.
  8. You can optionally add parameters to the Plug-in Inputs list:
    1. In the Plug-in Inputs table, click the Define Plug-in Inputs button.
    2. In the Input Bindings dialog, select the Available Inputs that you want to add and click the Add >> button to add them to the Used Inputs.
    3. Click Finish to close the dialog.
  9. Set the values for the inputs:
    1. Click the Expression column in the Plug-in Inputs table in front an input and then click the Edit Expression button. The SemQL editor opens.
    2. Edit the SemQL expression using the attributes to feed the plug-in input and then click OK to close the SemQL Editor.
    3. Repeat the previous steps to set an expression for the inputs.
  10. Select the attributes to bind no the Plug-in Outputs:
    1. In the Plug-in Outputs table, click the Define Plug-in Outputs button.
    2. In the Output Bindings dialog, select the Available Attributes that you want to enrich and click the Add >> button to add them to the Attributes Used.
    3. Click Finish to close the dialog.
  11. For each attribute in the Plug-in Outputs table, select in the Output Name column the plug-in output that you want to use to enrich the attribute shown in the Attribute Name column.
  12. Press CTRL+S to save the editor.
  13. Close the editor.

Pre and Post-Consolidation Validation

Several validations can be defined per Entity. Validations check attribute values and reject invalid records. All validations are executed on each record.
Validations can take place (and/or):

Note: Unique Keys are only checked post-consolidation, as they only make sense on consolidated records.

Pre vs. Post Validation

Pre-Consolidation Validation is done on the data from all publishers to this entity, after enrichment.
Post-Consolidation Validation is done on data de-duplicated and consolidated.

Choosing a validation to be done pre and/or post validation has an impact on the behavior of the integration hub.
The following examples will illustrate the impact of the choice of the pre or post consolidation validation.

Example #1:

In this example, CheckNullRevenue should be done Post-Consolidation to avoid rejecting information required later in the integration process.

Example #2:

In this example, IsValidGeocodedAddress should be done Pre-consolidation to avoid the performance cost of matching records with addresses not meeting the entity requirements.

Matching

The matching phase detects the duplicates in order to consolidate them into a single golden record.

Matching works differently for Fuzzy Matching and ID Matching entities.

Note: You can define a matcher on ID Matching Entities. It is used for for the sole purpose of detecting duplicates when creating new records in a data entry workflow. Such a matcher interactively warns the user when his new entry matches existing records.

Matcher

A Matcher is a two phase process:

The Binning Phase

The binning phase divides the source records into bins to allow the matching phase to run only within a given bin. As the matching phase can be a resource consuming one, reducing the number of comparisons is important for high performance matching.

Examples:

Binning is done using several expressions defined in the matcher. The records for which all binning expressions give the same results belong to the same bin.
For example, to perform binning for Customers with the same Country and Region first’s letter in the GeocodedAddress complex field, we would use:

Warning: Smaller Bins will mean faster processing, but, you must make sure that binning does not exclude possible matches.
For example, binning by Customer's first four letters of the last name will split “Mr. Bill Jones-Smith” from “Mr. Bill Jonnes-Smith” into different bins. These two duplicates caused by a typo will never be matched. For this specific case, you may consider a different attribute, or a SOUNDEX on the name.
If it recommended for binning to use preferably very accurate fields, such as CountryName, ZipCode, etc.

The Matching Phase

The matching phase uses an expression that compares two records.
This expression uses two pseudo records named Record1 and Record2 corresponding to the two records being matched. If this expression is true, then the two records are considered as matched.

For example, the following matching expression matches customer having Customer names meeting the two following requirements

(SOUNDEX(Record1.CustomerName) = SOUNDEX(Record2.CustomerName) OR 
SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName)>80)
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.City, Record2.InputAddress.City ) > 65

Creating a Matcher

Note: Only one matcher can be created for each entity.

To create a matcher:

  1. Expand the entity node, right-click the Matcher node and select Define SemQL Matcher.... The Create New SemQL Matcher wizard opens.
  2. In the Description field, optionally enter a description for the Matcher.
  3. Click Finish to close the wizard. The SemQL Matcher editor opens.
  4. Define the Binning Expressions:
    1. In the Binning Expressions table, click the Add Binning Expression button. The SemQL editor opens.
    2. Create a SemQL expression used to bin records for this entity, and then click OK to close the SemQL Editor. This expression may use any attribute of the current entity.
    3. Repeat the previous steps to create all your binning expressions.
  5. Define the Matching Expression:
    1. In the Matching Expression section, click the Edit Expression button. The SemQL editor opens.
    2. Create a SemQL expression used to match records for this entity, and then click OK to close the SemQL Editor. This expression may use any attribute of the current entity.
  6. Press CTRL+S to save the editor.
  7. Close the editor.

Consolidation

Consolidation merges fields from all the duplicates detected into a single golden record. It is defined in the Consolidator defined in the entity.

Consolidation Type

Consolidation uses one of the following methods:

Consolidation Strategies

A consolidation strategy defines how to choose the best record or field value in the consolidation process. The consolidation strategies available differ depend on the consolidation method.

Record Level Consolidation

Record Level Consolidation supports the following strategies:

Record level consolidation uses an Additional Order By option. This option is a SemQL Expression used to sort records in the event of an ambiguity after the first strategy. For example, when two records are duplicates from the same publisher and Preferred Source strategy is used.

Field Level Consolidation

With this method, a different strategy can be selected for each attribute

Field Level Consolidation supports the following strategies:

A global Additional Order By option stores a SemQL Expression used to sort records in the event of an ambiguity after the first strategy, for example, when two fields having different values are duplicates from the same publisher and a Preferred Source strategy is used. The expression is an order by clause and can contain the specification of the ascending (ASC) or descending (DESC) order. Note that the additional order by clause is not supported for a Most Frequent Value consolidation strategy.

Creating a Consolidator

To create a record-level consolidator:

  1. Expand the entity node, right-click Consolidator node and select Define Consolidator.... The Create New Consolidator wizard opens.
  2. In the Create New Consolidator wizard, select Record Level Consolidation in the Consolidator Type.
  3. In the Description field, optionally enter a description for the Consolidator.
  4. Click Finish to close the wizard. The Consolidator editor opens.
  5. In the Record Level Consolidation Strategy , select the consolidation strategy.
  6. Set the parameters depending on the selected strategy.
  7. Set the Additional Order By expression.
    1. On the Additional Order By field, click the Edit Expression button. The SemQL editor opens.
    2. Create a SemQL expression used to handle consolidation disambiguation, and then click OK to close the SemQL Editor.
  8. Press CTRL+S to save the editor.
  9. Close the editor.

To create a field-level consolidator:

  1. Expand the entity node, right-click Consolidator node and select Define Consolidator.... The Create New Consolidator wizard opens.
  2. In the Create New Consolidator wizard, select Field Level Consolidation in the Consolidator Type.
  3. In the Description field, optionally enter a description for the Consolidator.
  4. Click Finish to close the wizard. The Consolidator editor opens. All the fields appear in the Field Level Consolidators table and are defined with the Any Value strategy.
  5. To modify the consolidation strategy for a field:
    1. Double-click the Attribute Name in the Field Level Consolidators table.
    2. In the Define Field-Level Consolidator wizard, select a consolidation strategy.
    3. Set the parameters depending on the selected strategy:
  6. Set the Additional Order By expression.
    1. On the Additional Order By field, click the Edit Expression button. The SemQL editor opens.
    2. Create a SemQL expression used to handle consolidation disambiguation, and then click OK to close the SemQL Editor.
  7. Press CTRL+S to save the editor.
  8. Close the editor.

Creating Integration Jobs

An Integration Job is a job executed by Semarchy Convergence for MDM to integrate and certify source data into golden records. This job uses the rules defined as part of the integration process, and contains a sequence of Tasks running these rules. Each task addresses one entity, and performs several processes (Enrichment, Validation, etc.) for this entity.

Creating Jobs

To create a job:

  1. Right-click the Jobs node and select Add Job.... The Create New Job wizard opens.
  2. In the Create New Job wizard, check the Auto Fill option and then enter the following values:
  3. Click Next.
  4. In the Tasks page, select the Available Entities that you want to process in this job and click the Add >> button to add them to the Selected Entities.
  5. Click Finish to close the wizard. The Job editor opens.
  6. Select Tasks in the editor sidebar. In the list of Tasks, the entities involved in each task are listed, as well as the processes (Enrichers, Matchers, etc.) that will run for these entities.
  7. Use the Move Up and Move Down buttons to order the tasks.
  8. To edit the processes involved in one task:
    1. Double-click the entity Name in the Tasks table. The editor switches to the Task editor.
    2. Select the process that you want to enable for this task.
    3. Use the editor breadcrumb to go back to the Job editor.
  9. Press CTRL+S to save the editor.
  10. Close the editor.

Jobs Parameters

Jobs can be parameterized for optimizing their execution.

To change a job parameter:

  1. In the job editor, select Job Parameters in the editor sidebar.
  2. In the Job Parameters table, click the Add Parameter button. The Create New Job Parameter wizard opens.
  3. In the Name field, enter the name of the parameter.
  4. In the Value field, enter the value for this parameter.
  5. Click Finish to close the wizard.
  6. Press CTRL+S to save the editor.
  7. Close the editor.

The following table lists the parameters available to customize the jobs.

Parameter Name Values Description
PARAM_RECYCLE_ERRORS 0 or 1 If this parameter is set to 1, error recycling is triggered and rejects from previous job executions are recycled in this job.
PARAM_ANALYZE_STATS 0 or 1 If this parameter is set to 1, statistics collection is triggered in the MDM hub tables to optimize processing. This open is useful to accelerate the processing of large data sets.
PARAM_AGGREGATE_JOB_ENRICHERS 0 or 1 If this parameter is set to 1, consecutive SemQL enrichers are merged into a single SQL statement when executed. This applies to all entities.
PARAM_AGGREGATE_ENTITY_ENRICHERS_<entity_name> 0 or 1 If this parameter is set to 1, consecutive SemQL enrichers are merged into a single SQL statement when executed. This applies only to the entity which name is provided as <entity_name>.

Jobs Sequencing and Parallelism

Jobs are a sequence of task. These tasks must be ordered to handle referential integrity. For example, if you perform all the tasks on Contact then Customer, it is likely that new contacts attached to new customers will not be integrated as the new golden customers are not created yet.

They are themselves executed sequentially in a defined Queues in a FIFO (First-In First-Out) mode.
If two jobs can run simultaneously, they should be in different queues. For example, if two jobs address two different areas in the same model, then these jobs can run simultaneously in different queues.

Designing Integration Jobs

It is recommended to create jobs specific to the data loads performed by the data integration batches, and dedicated jobs for human workflows.

Integration Jobs for Data Integration

Data published in batch may target several entities.

It is recommended to define jobs specific to the data loads targeting the hub:

Integration Jobs for Data Entry Workflows

A data entry workflow uses a Business Object composed of several entities.

It is recommended to define a specific job for each data entry workflow:

Integration Jobs for Duplicate Management Workflows

A duplicate management workflow handles duplicates detected on a specific entity.

It is recommended to define a specific jobs for each duplicate management workflow: