In this tutorial, you will learn how to create robust match rules and how to create enrichers to facilitate these match rules. You will also define some custom survivorship rules. These activities are essential to ensure that data coming from the different publishers is matched and consolidated optimally.

What you'll learn

Before you start

This tutorial is the second unit of the Data Consolidation track. If not done yet, you must follow the first unit, Create your First Data Consolidation Application.

The estimated duration of this unit is about 1 hour and 40 minutes.

Do your homework!

In this tutorial, we give you the rules to perform some useful matching on Business-to-Business (B2B) data.

You will learn all the mechanics to write the match rules for your data requirements.

In a real Data Management project, you will need to do some homework! Consider the following:

With that in mind, let's get started.

Semarchy xDM allows you to create enrichers to normalize, standardize, and enrich data loaded into or authored within your xDM application. This is a key success factor for matching as it allows matching on cleaner, richer, and more standardized data.

In this first step, you will be guided to create a SemQL enricher to standardize company names.

What you'll learn

Add the Normalized Name attribute

Before setting up an enricher, you first need to add an attribute to the Company entity to store standardized company names.

  1. From the Designer perspective of the Application Builder, expand Entities>Company>Attributes.
  2. Right-click on Attributes and select Add Simple Attribute.

  1. Enter NormalizedName as the Name, keep the other default values, and then click Finish.

  1. Save the model.

Create an enricher

Let's now add a SemQL enricher to perform basic data standardization on CompanyName. The goal is to remove punctuation.

  1. Under the Company entity, right-click on Enrichers and select Add SemQL Enricher:

  1. Enter RemovePunctuation as the Name and click Next.

  1. Define the attribute to enrich: select the NormalizedName attribute, click the Add>> button, then Finish.

  1. In the Enricher Expressions section, click the Edit Expression button next to the NormalizedName attribute.

  1. Copy and paste the SemQL expression for your database (or build it by double-clicking the REGEXP_REPLACE function in the Functions list and setting the parameters), and then click OK.
REGEXP_REPLACE( CompanyName, '[[:punct:]]', ' ', 'g' )
REGEXP_REPLACE( CompanyName, '[[:punct:]]', ' ' )

  1. Click Save to save the enricher.

Add the new attributes to the Company Collection

  1. Expand the Collections node under the Company entity and double-click on CompanyCollection:

Deploy the model changes

You first need to deploy the changes you just performed on your model.

  1. Go to the Management perspective.
  2. Right-click on B2BTutorial and select Deploy Model Edition.

  1. Click Next.

  1. Click Finish to deploy the listed changes to the Data Location:

  1. Wait until all changes are deployed.

Reload data

Reloading data is required to execute the enricher on all existing data. In this section, you will delete and load data again using the xDM user interface. If you're loading from a data integration tool, you may use it to truncate and reload data.

  1. Return to the Welcome Page and click on B2BTutorial.

  1. Click on Companies (Golden) in the navigation drawer.
  2. Click the "Select all" checkbox in the collection header to select all records, open the action menu, and then click Delete:

  1. Click on Companies (Master) in the navigation drawer, open the action menu of the Companies collection, and then select Import Master Records.

  1. From the tutorial resources, select the
    data-consolidation\datasets\companies-data.xlsx file.file from the tutorial resources.
  2. Click CONTINUE:

  1. Click CONTINUE on the next screen to confirm the mapping.
  2. Click FINISH on the Summary page:

  1. Review the imported data and click FINISH.

  1. When the toaster indicates "Changes successfully applied", click on CLICK TO REFRESH to see the data you just loaded:

  1. See that the Normalized Name column is now populated: for instance, the Normalized Name for "FreightWise, Inc." is "FreightWise Inc".

  1. Click on Companies (Golden) in the navigation drawer. The Normalized Name is also available on golden records..

Improve company name standardization

Your enricher misses one important special case: the character "&" is treated like any other, but has a meaning you want to preserve.

For instance, we would like "Ben and Jerry's" and "Ben & Jerrys" to be treated as a strong match and to be merged accordingly, possibly without human confirmation. This is a very common special case in match rules.

That's why you will update the enricher to replace "&" with "AND" before removing any punctuation.

  1. Click on your avatar picture on the top-right corner of the screen to open the user menu, and select All applications.

  1. Open the Application Builder and select the B2BTutorial [0.0] model.
  2. Expand Entities>Company>Enrichers, and double-click on RemovePunctuation.

  1. Click the Edit expression button next to the enricher's expression:

  1. Replace the existing expression with the following one (depending on your database) and click OK. Observe that this expression will change company names to uppercase, remove punctuation and replace "&" with "AND".
REGEXP_REPLACE( 
        REPLACE( UPPER(CompanyName), '&', ' AND ' ),
        '[[:punct:][:blank:]]+', /* find one or more spaces or punctuation marks */
        ' ',                     /* replace with a single space */
        'g'                      /* replace all occurrences */
)
REGEXP_REPLACE( 
        REPLACE( UPPER(CompanyName), '&', ' AND ' ),
        '[[:punct:][:blank:]]+', /* find one or more spaces or punctuation marks */
        ' '                      /* replace with a single space */
)

  1. Save your work.

You will observe the benefits of this enhancement when you reload data in the next step.

Congratulations

You have successfully completed the first step of this tutorial by adding a first enricher.

What we've covered

In the next step, you will add another enricher for name phonetization.

In this step, you will add another enricher to phoneticize names and help matching records.

What you'll learn

Add the PhoneticName attribute

Let's start by adding an attribute to store the phonetized name. Proceed the same way you did for NormalizedName.

  1. Right-click on the Attributes node under the Company entity and select Add Simple Attribute.

  1. Enter PhoneticName as the Name, keep the other default parameters, and then click Finish.

  1. Save the model.

Add the phonetization enricher

You will now use the Semarchy xDM Text Enricher Plug-in to phoneticize company names. This is an extremely useful technique depending on where your source errors come from.

  1. Under the Entities>Company node, right-click on Enrichers and select Add API Enricher.

  1. Enter the following values in the wizard and then click Finish:

  1. Configure the transformation: scroll down to the Plug-in Params section. Paste the following value in the Transformation parameter:
DOUBLEMETAPHONE 12 TRUE

  1. In the Plug-in Inputs section, click the Edit expression button for the Input Text row.

  1. Double-click on Phonetic Name in the Attributes list to add it to the expression and then click OK.

  1. In the Outputs section, click Define Outputs:

  1. Add the PhoneticName attribute and select Transformed Text as the Output Name:

  1. Save your work and validate the model. No error or warning should be reported..

Deploy the model changes

Repeat the same steps to deploy the model changes:

  1. Go to the Management perspective.
  2. Right-click on B2BTutorial and select Deploy Model Edition.

  1. Click Next.

  1. Click Finish to deploy the changes to the Data Location.

Reload data

You now need to reload data to see the improved name standardization and the name phonetization. Repeat the same operations as previously:

  1. Return to the welcome page and access the B2BTutorial application again.
  2. Click on Companies (Golden) in the navigation drawer.
  3. Select the "Select all" checkbox in the collection header to select all records, open the action menu, and click on Delete:

  1. Select Companies (Master) in the navigation drawer, open the actions menu of the Companies collection, and then select Import Master Records.

  1. Select the data-consolidation\datasets\companies-data.xlsx file from the tutorial resources.
  2. Confirm the import options, mapping, and validate the import summary.
  3. Review the imported data and click FINISH.

  1. When the toaster indicates "Changes successfully applied", click on CLICK TO REFRESH:

  1. Click on Companies (Golden) in the navigation drawer to see the enriched golden data. Observe that:

  1. Click on the Company column header to sort results alphabetically:

  1. Browse the records and identify duplicates that could be detected by match rules using the Normalized Name or the Phonetic Name attribute:

Congratulations

You have successfully completed the second step of this tutorial by adding a Plug-in enricher.

What we've covered

The next step will focus on updating match rules to take advantage of these enriched fields.

Semarchy xDM provides a powerful mechanism to define multiple match rules with different match scores, and merge policies to define what happens to clusters of potential matches as they become golden records.

Let's see how it works.

What you'll learn

Multiple match rules and scores

To improve the matching, we want to apply several rules and associate a different confidence score to each of them:

Add match rules

Let's now add match rules that will leverage the normalized and phonetic names you added previously.

Add a match rule for normalized name

  1. Open the existing SemQL matcher that you initialized during the previous tutorial unit Create Your First Data Authoring Application: In the Application Builder, expand Entities>Company>Matcher, and then double-click on SemQLMatcher - Company.

  1. Click Add Match Rule on top of the Match Rules list:

  1. Enter the following parameters:

  1. Scroll down to the Matching section.

  1. Paste the following expression in the SemQL editor and click OK:
Record1.NormalizedName = Record2.NormalizedName 
AND ( 
  Record1.PostalCode = Record2.PostalCode
  OR Record1.PostalCode IS NULL OR Record2.PostalCode IS NULL
)
AND Record1.Country = Record2.Country

6. Click Save.

Add a match rule for phonetic name

Add a rule for the PhoneticName attribute by following the same procedure:

  1. Click on SemQLMatcher - Company in the breadcrumb to return to the matcher's configuration:

  1. Click Add Match Rule on top of the Match Rules list.

  1. Enter the following values:
Record1.PhoneticName = Record2.PhoneticName 
AND ( 
  Record1.PostalCode = Record2.PostalCode
  OR Record1.PostalCode IS NULL OR Record2.PostalCode IS NULL
)
AND Record1.Country = Record2.Country

  1. Save the rule.
  2. Click again on SemQLMatcher - Company in the breadcrumb to return to the matcher's configuration. Review your matcher, now composed of 3 rules:

Configure thresholds

Now that new match rules have been defined to identify potential matches, we don't want all matches to merge automatically anymore. Let's modify the thresholds for the Company SemQLMatcher accordingly:

  1. Within the SemQLMatcher editor, scroll down to the Merge Policy section.
  2. Enter the following thresholds to define when records will be merged automatically:

  1. Within the Auto-Confirm Policy section, set the Auto-confirm golden records property to 91. This means that only golden records with a confidence score of 91 or higher will be confirmed and merged automatically.

Deploy, reload and test

  1. Save your work and validate the model.

  1. Deploy the model: go to the Management perspective, right-click on the B2BTutorial Data Location, and then select Deploy Model Edition:

  1. Reload data by repeating the same operations as in step 2. Enrich and Standardize.

  1. Click on Companies (Golden) in the navigation drawer to open the enriched golden data.
  2. Observe that the list of golden records now contains 54 records instead of 74 previously. This means that the new matching rules using the Normalized and Phonetic Names allowed to match and merge more records.

  1. Sort records by Company and scroll until Freight Ways. Note that the FrateWize record is not visible anymore. Let's check that it was merged with Freight Ways thanks to their matching Phonetic Names:

  1. Click on Companies (Golden) in the navigation drawer to return to the list of golden records. Observe that many records were failing to match because the legal entity type was present in one source but not another:

You will learn how to fix this in the next step.

Congratulations

You have successfully configured match rules.

What we've covered

In the next step, you will improve the matching.

To improve the matching, we will now add a new enricher which removes business entity type, and adjust the sequence of execution for enrichers as follows:

  1. Remove punctuation
  2. Remove business entity type (new)
  3. Phoneticize

This will result in an improved NormalizedName and will get your matching to a very interesting level.

What you'll learn

Add the Remove Business Entity Type enricher

Start by adding the new enricher that removes business entity types:

  1. Return to the Application Builder and open the Design perspective.
  2. Under the Company entity, right-click on Enrichers and select Add SemQL Enricher:

  1. Enter RemoveBusinessEntityType as the Name and click Next.

  1. Define the attribute to enrich: select the NormalizedName attribute, click the Add>> button then Finish.

  1. Click the Edit Expression button next to the NormalizedName attribute in the Enricher Expressions section of the enricher's editor.

  1. Copy and paste the SemQL expression (note that it enriches from NormalizedName back into NormalizedName), and then click OK:
REGEXP_REPLACE (
    NormalizedName,
    '[[:blank:]]+(' /* find spaces preceding the business entity type */
     || 'COMPANY|COMPANY[[:blank:]]*INC|CORPORATION|CORP|AND[[:blank:]]*CO|CO|CO[[:blank:]]*INC|INCORPORATED|INC|LIMITED|LTD|ASSOCIATION|ASSOC|' /* Anglo entities */
     || 'L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*C|P[[:blank:]]*L[[:blank:]]*C|' /* More Anglo entities */
     || 'S[[:blank:]]*A[[:blank:]]*R[[:blank:]]*L|S[[:blank:]]*A[[:blank:]]*S|' /* French entities */
     || 'A[[:blank:]]*G|G[[:blank:]]*M[[:blank:]]*B[[:blank:]]*H|'              /* German entities */
     || 'S[[:blank:]]*A|S[[:blank:]]*L|'                                        /* Spanish entities */
     || 'K[[:blank:]]*K|G[[:blank:]]*K|'                                        /* Japanese entities */
     || ')[[:blank:]]*$',
    ''  /* replace the business entity type with an empty string */
)

  1. Save your work.

Change the sequence of enrichers

By default, the new enricher is executed in the last position. We will now change the sequence so that RemoveBusinessEntityType is executed after RemovePunctuation and before PhoneticizeName.

  1. Double-click on Enrichers in the Model Design View.

  1. Select the new enricher RemoveBusinessEntityType and click the Move up button once to move it between RemovePunctuation and PhoneticizeName.

  1. Save your work.

Deploy, reload and test

  1. Validate the model.

  1. Deploy the model.

  1. Return to the B2BTutorial application and repeat the same steps to reload data:

  1. Go to Companies (Golden) to see the resulting golden records:

Display the matching results

You already browsed through the golden data to see which records merged.

Semarchy xDM tracks lots of additional metadata about the matches. You will make some of it available to data stewards:

  1. Return to the Application Builder.
  2. Go to Entities>Company>Collections, and double-click on CompanyCollection to open the Collection editor.

  1. Drag and drop the Confirmation Status attribute from the Attributes list to the list of columns of the CompanyCollection.

  1. Repeat the same operation for the following columns:
  1. Save the collection.
  2. Match rules were not changed, so there is no need to redeploy but only to refresh the application:

  1. Click on Companies (Golden) in the navigation drawer.
  2. Scroll right to see the new columns:

Congratulations

You have successfully refined match rules with the help of a new enricher. Your model is now able to make matches with all of these variations:

What we've covered

The next step will focus on survivorship rules to pick the best golden values and to also allow data stewards to override values when needed.

Semarchy xDM provides a platform for the developer to define survivorship rules that select the best value from the various source records that match and merge.

In addition to these automated rules, the xDM application gives data stewards the ability to override values when the default survivorship rules are not optimal.

You will discover a lot of exciting features in the next steps to make sure that your mastered data is as good as it can possibly be.

What you'll learn

Default survivorship rules

Default rule

In the previous unit Build Your First Data Consolidation Application, you defined the Default Rule for survivorship using a Consolidation Strategy of Preferred Publisher.

Preferred Publisher consists of ordering publishers manually and is perhaps the most common strategy used. It is appropriate for any field when one source system is the system of record for that item.

Semarchy xDM will apply the Default Rule to any attributes that are not explicitly defined to another rule. In this step, you will define explicit survivorship rules for groups of related attributes.

Id survivorship rule

The Id Survivorship Rule defines which source ID is the "winner" in the golden record. This technical rule is generally transparent to your end-users.

It's important in cases like this: a data steward creates a new Contact in xDM and links it to a golden customer. Later over-matching is detected, and the Customer is split into two Customers. Which new Customer should the contact now belong to? The Contact follows the winning source ID.

Configure the address survivorship rule

In the case of addresses, we want to implement the following rule:

You will use a Custom Ranking to create this variation on the standard Preferred Publisher Strategy.

  1. Return to the Application Builder.
  2. Right-click on Survivorship Rules and select Add Survivorship Rule.

  1. Enter Address as the Name and click Finish.

  1. Click Define Survivorship Rule Attributes.

  1. Select and add the following attributes to the Displayed Attributes list by clicking the Add button, and then click Finish:

  1. Scroll down to the Consolidation Rule section.
  2. Keep Custom Ranking as the Consolidation Strategy.
  3. Click the Edit Expression button next to the Ranking Expression field.

  1. Enter the following expression and then click OK:
CASE 
  WHEN Country  = 'US' AND PublisherID  = 'ERP' THEN 1
  WHEN Country  = 'US' AND PublisherID != 'ERP' THEN 2
  WHEN Country != 'US' AND PublisherID  = 'WEB' THEN 1
  WHEN Country != 'US' AND PublisherID != 'WEB' THEN 2
  ELSE 3
END,
CASE 
  WHEN PublisherID = 'FINANCE' THEN 1
  WHEN PublisherID = 'CRM' THEN 2
  ELSE 3
END,
UpdateDate DESC

  1. Select Override - until consolidated value changes in the Override Strategy

  1. Save your work.

Configure the Finance owned survivorship rule

In the context of our B2B application, the Finance team is responsible for these attributes:

Data stewards are allowed to override. But if the Finance team makes a subsequent update, then they should start winning again. That's why we will use Override - until consolidated value changes as Override Strategy.

  1. Add a new survivorship rule as you did in the previous section. Enter FinanceOwned as the Name:

  1. Click Define Survivorship Rule Attributes:

  1. Add the following attributes and then click Finish:

  1. In the Consolidation Rule section:

  1. Add publishers in the following order and then click Finish:

  1. Select Override - until consolidated value changes as the Override Strategy in the Override Rule section.

  1. Save your work.

Deploy and reload

To apply the new survivorship rules, you must deploy the changes and reload data. Repeat the steps that you performed when you configured the match rules:

  1. Validate the model and check that no error is raised.

  1. Deploy the model.

  1. Return to the B2BTutorial application and repeat the same steps to reload the data:

In the next step, you will connect to your application to observe the impact of your new rules.

Congratulations

You have successfully completed the fourth step of this tutorial by defining the survivorship rules.

Now that survivorship rules are defined, your golden records will be correctly calculated from each cluster of master records that get matched together, using data from the most relevant publisher for each attribute.

What we've covered

The next part will reveal all your rules in action in your application.

To finish this tutorial, you will use the Entities view to see the results of your match rules and survivorship rules.

The Entities view is provided by default with all applications to browse, for all entities, the following data:

This view provides lots of details and is very useful at design time. You can customize later your Business Views to display exactly what your business users want to see.

Let's now use the Entities view:

  1. In the B2BTutorial application, click on Entities in the navigation drawer, and then click the Golden Data tile.

  1. Click the Companies tile:

  1. The list of golden records is displayed.

  1. Click on any column header (e.g Company Name) to sort records.

  1. To select and reorder columns, proceed as follows:

  1. Scroll horizontally to see all attributes from the entity:
  1. Identify the Masters Count column. This column indicates how many master records are linked to a golden record. Click twice on its header to sort the table in descending order. Click on the highest value (first row) to open the record.

  1. Select the MASTER RECORDS tab to see all master records this record is linked to. Observe that the values highlighted in blue are the ones that contributed to the golden record, as per the survivorship rules you have configured. For instance, the values of Company Name, Website, Industry in the golden record are coming from the FINANCE publisher which was set as the preferred publisher for them.

  1. Click the Next (>) button to go to the next record.

  1. Observe how this second record was consolidated, following the survivorship rules:

  1. You can browse and review other records by clicking the Previous (<) and Next (>) buttons.

Excellent job following this tutorial! With this unit, you learned to enrich, match and merge data within your first data consolidation application.

What we've covered

What's next?

If you have not done so already, you can follow the Data Authoring track and create your first data authoring application, or the Integration track to learn how to query and load data via using the SQL and REST APIs.