In this tutorial, you will learn how to create robust match rules and how to create enrichers to facilitate these match rules. You will also define some custom survivorship rules. These activities are essential to ensure that data coming from the different publishers is matched and consolidated optimally.
What you'll learn
Designing enrichers to standardize and augment your data
The estimated duration of this unit is about 1 hour and 40 minutes.
Do your homework!
In this tutorial, we give you the rules to perform some useful matching on Business-to-Business (B2B) data.
You will learn all the mechanics to write the match rules for your data requirements.
In a real Data Management project, you will need to do some homework! Consider the following:
How much over-matching is acceptable?
How much under-matching is acceptable?
Where is your data most reliable?
Where is your data least reliable?
What is the source of errors in source data?
With that in mind, let's get started.
Semarchy xDM allows you to create enrichers to normalize, standardize, and enrich data loaded into or authored within your xDM application. This is a key success factor for matching as it allows matching on cleaner, richer, and more standardized data.
In this first step, you will be guided to create a SemQL enricher to standardize company names.
What you'll learn
Creating a SemQL enricher
Reloading data to see the impact of your enricher on matching
Add the Normalized Name attribute
Before setting up an enricher, you first need to add an attribute to the Company entity to store standardized company names.
From the Designer perspective of the Application Builder, expand Entities>Company>Attributes.
Right-click on Attributes and select Add Simple Attribute.
Enter NormalizedName as the Name, keep the other default values, and then click Finish.
Save the model.
Create an enricher
Let's now add a SemQL enricher to perform basic data standardization on CompanyName. The goal is to remove punctuation.
Under the Company entity, right-click on Enrichers and select Add SemQL Enricher:
Enter RemovePunctuation as the Name and click Next.
Define the attribute to enrich: select the NormalizedName attribute, click the Add>> button, then Finish.
In the Enricher Expressions section, click the Edit Expression button next to the NormalizedName attribute.
Copy and paste the SemQL expression for your database (or build it by double-clicking the REGEXP_REPLACE function in the Functions list and setting the parameters), and then click OK.
Expand the Collections node under the Company entity and double-click on CompanyCollection:
Drag and drop Normalized Name and Phonetic Name from the Attributes list to the content of the Collection:
Save the Collection.
Deploy the model changes
You first need to deploy the changes you just performed on your model.
Go to the Management perspective.
Right-click on B2BTutorial and select Deploy Model Edition.
Click Next.
Click Finish to deploy the listed changes to the Data Location:
Wait until all changes are deployed.
Reload data
Reloading data is required to execute the enricher on all existing data. In this section, you will delete and load data again using the xDM user interface. If you're loading from a data integration tool, you may use it to truncate and reload data.
Return to the Welcome Page and click on B2BTutorial.
Click on Companies (Golden) in the Navigation Drawer.
Click the "Select all" checkbox in the collection header to select all records, open the action menu, and then click Delete:
Click on Companies (Master) in the Navigation Drawer, open the action menu of the Companies collection, and then select Import Master Records.
From the tutorial resources, select the data-consolidation\datasets\companies-data.xlsx file.file from the tutorial resources.
Click CONTINUE:
Click CONTINUE on the next screen to confirm the mapping.
Click FINISH on the Summary page:
Review the imported data and click FINISH.
When the toaster indicates "Changes successfully applied", click on CLICK TO REFRESH to see the data you just loaded:
See that the Normalized Name column is now populated: for instance, the Normalized Name for "FreightWise, Inc." is "FreightWise Inc".
Click on Companies (Golden) in the Navigation Drawer. The Normalized Name is also available on golden records..
Improve company name standardization
Your enricher misses one important special case: the character "&" is treated like any other, but has a meaning you want to preserve.
For instance, we would like "Ben and Jerry's" and "Ben & Jerrys" to be treated as a strong match and to be merged accordingly, possibly without human confirmation. This is a very common special case in match rules.
That's why you will update the enricher to replace "&" with "AND" before removing any punctuation.
Click on your avatar picture on the top-right corner of the screen to open the user menu, and select All applications.
Open the Application Builder and select the B2BTutorial [0.0] model.
Expand Entities>Company>Enrichers, and double-click on RemovePunctuation.
Click the Edit expression button next to the enricher's expression:
Replace the existing expression with the following one (depending on your database) and click OK. Observe that this expression will change company names to uppercase, remove punctuation and replace "&" with "AND".
On PostgreSQL, use the following expression:
REGEXP_REPLACE(
REPLACE( UPPER(CompanyName), '&', ' AND ' ),
'[[:punct:][:blank:]]+', /* find one or more spaces or punctuation marks */
' ', /* replace with a single space */
'g' /* replace all occurrences */
)
On Oracle, use the following expression:
REGEXP_REPLACE(
REPLACE( UPPER(CompanyName), '&', ' AND ' ),
'[[:punct:][:blank:]]+', /* find one or more spaces or punctuation marks */
' ' /* replace with a single space */
)
Save your work.
You will observe the benefits of this enhancement when you reload data in the next step.
Congratulations
You have successfully completed the first step of this tutorial by adding a first enricher.
What we've covered
You created a SemQL enricher to standardize company names.
You deleted and reloaded data to execute your enricher in action.
You improved your enricher by removing punctuation.
In the next step, you will add another enricher for name phonetization.
In this step, you will add another enricher to phoneticize names and help matching records.
What you'll learn
Using a Plug-in enricher to generate a phonetic version of a company name
Reloading data to see the impact of your enricher on matching
Add the PhoneticName attribute
Let's start by adding an attribute to store the phonetized name. Proceed the same way you did for NormalizedName.
Right-click on the Attributes node under the Company entity and select Add Simple Attribute.
Enter PhoneticName as the Name, keep the other default parameters, and then click Finish.
Save the model.
Add the phonetization enricher
You will now use the Semarchy xDM Text Enricher Plug-in to phoneticize company names. This is an extremely useful technique depending on where your source errors come from.
Under the Entities>Company node, right-click on Enrichers and select Add API Enricher.
Enter the following values in the wizard and then click Finish:
Select the Java Plug-in option.
Select the Semarchy Text Enricher in the drop-down list.
Enter PhoneticizeName as the Name of your enricher.
Configure the transformation: scroll down to the Plug-in Params section. Paste the following value in the Transformation parameter:
DOUBLEMETAPHONE 12 TRUE
In the Plug-in Inputs section, click the Edit expression button for the Input Text row.
Double-click on Phonetic Name in the Attributes list to add it to the expression and then click OK.
In the Outputs section, click Define Outputs:
Add the PhoneticName attribute and select Transformed Text as the Output Name:
Save your work and validate the model. No error or warning should be reported.
Deploy the model changes
Repeat the same steps to deploy the model changes:
Go to the Management perspective.
Right-click on B2BTutorial and select Deploy Model Edition.
Click Next.
Click Finish to deploy the changes to the Data Location.
Reload data
You now need to reload data to see the improved name standardization and the name phonetization. Repeat the same operations as previously:
Return to the welcome page and access the B2BTutorial application again.
Click on Companies (Golden) in the Navigation Drawer.
Select the "Select all" checkbox in the collection header to select all records, open the action menu, and click on Delete:
Select Companies (Master) in the Navigation Drawer, open the actions menu of the Companies collection, and then select Import Master Records.
Select the data-consolidation\datasets\companies-data.xlsx file from the tutorial resources.
Confirm the import options, mapping, and validate the import summary.
Review the imported data and click FINISH.
When the toaster indicates "Changes successfully applied", click on CLICK TO REFRESH:
Click on Companies (Golden) in the Navigation Drawer to see the enriched golden data. Observe that:
The Normalized Name column now contains uppercase letters only.
The value of Normalized Name for "J.P. Morgan Chase & Co." is "J P MORGAN CHASE AND CO".
The Phonetic Name column is now populated.
Click on the Company column header to sort results alphabetically:
Browse the records and identify duplicates that could be detected by match rules using the Normalized Name or the Phonetic Name attribute:
Hewlett Packard and Hewlett-Packard have slightly different names but the same Normalized Name or the Phonetic Name:
FrateWize and Freight Ways have very different values for Normalized Name, but the same Phonetic Name:
Congratulations
You have successfully completed the second step of this tutorial by adding a Plug-in enricher.
What we've covered
You created a Plug-in enricher to phoneticize company names.
You deleted and reloaded data.
The next step will focus on updating match rules to take advantage of these enriched fields.
Semarchy xDM provides a powerful mechanism to define multiple match rules with different match scores, and merge policies to define what happens to clusters of potential matches as they become golden records.
Let's see how it works.
What you'll learn
Improving matching by defining multiple match rules
Adding a new enricher to improve the matching
Displaying matching metadata in your application
Multiple match rules and scores
To improve the matching, we want to apply several rules and associate a different confidence score to each of them:
Perfect match: for our data, an exact match on the name and the address is a perfect match. The related match score is 100.
Normalized name: our different source systems use punctuation and business entity types ('Inc' and 'Corp') differently. If we account for that, then matching by name is quite confident and we can use a match score of 94.
Phonetic name: someone entered a name provided by someone else. As long as the postal code also matches, we're reasonably confident in these phonetic matches and will use a match score of 88.
Add match rules
Let's now add match rules that will leverage the normalized and phonetic names you added previously.
Add a match rule for normalized name
Open the existing SemQL matcher that you initialized during the previous tutorial unit Create Your First Data Consolidation Application: In the Application Builder, expand Entities>Company>Matcher, and then double-click on SemQLMatcher - Company.
Click Add Match Rule on top of the Match Rules list:
Enter the following parameters:
Name: NormalizedName
Documentation: Match on NormalizedName within a country where PostalCode does not conflict
Scroll down to the Matching section.
Set Match Score to 94 for your rule.
Click the Edit Expression button next to the Match Condition field.
Paste the following expression in the SemQL editor and click OK:
Record1.NormalizedName = Record2.NormalizedName
AND (
Record1.PostalCode = Record2.PostalCode
OR Record1.PostalCode IS NULL OR Record2.PostalCode IS NULL
)
AND Record1.Country = Record2.Country
6. Click Save.
Add a match rule for phonetic name
Add a rule for the PhoneticName attribute by following the same procedure:
Click on SemQLMatcher - Company in the breadcrumb to return to the matcher's configuration:
Click Add Match Rule on top of the Match Rules list.
Enter the following values:
Name: PhoneticName
Documentation: Match on PhoneticName within a country where PostalCode does not conflict
Match Condition: paste the following expression:
Record1.PhoneticName = Record2.PhoneticName
AND (
Record1.PostalCode = Record2.PostalCode
OR Record1.PostalCode IS NULL OR Record2.PostalCode IS NULL
)
AND Record1.Country = Record2.Country
Match Score: 88
Save the rule.
Click again on SemQLMatcher - Company in the breadcrumb to return to the matcher's configuration. Review your matcher, now composed of 3 rules:
PerfectMatch (score: 100)
NormalizedName (score: 94)
PhoneticName (score: 88)
Configure thresholds
Now that new match rules have been defined to identify potential matches, we don't want all matches to merge automatically anymore. Let's modify the thresholds for the Company SemQLMatcher accordingly:
Within the SemQLMatcher editor, scroll down to the Merge Policy section.
Enter the following thresholds to define when records will be merged automatically:
Create a golden record from new master records: 85
Merge unconfirmed golden records: 85
Merge confirmed golden records: 85
Merge unconfirmed with confirmed golden records: 85
Add new master records to an unconfirmed golden record: 85
Add new master records to a confirmed golden record: 85
Within the Auto-Confirm Policy section, set the Auto-confirm golden records property to 91. This means that only golden records with a confidence score of 91 or higher will be confirmed and merged automatically.
Deploy, reload and test
Save your work and validate the model.
Deploy the model: go to the Management perspective, right-click on the B2BTutorial Data Location, and then select Deploy Model Edition:
Reload data by repeating the same operations as in step 2. Enrich and Standardize.
Return to the B2BTutorial application.
Click on Companies (Golden) in the Navigation Drawer.
Select the "Select all" checkbox in the collection header to select all records, open the actions menu, and then click on Delete:
Click on Companies (Master) in the Navigation Drawer, then click the action menu of the Companies collection, and then select Import Master Records.
Select the data-consolidation\datasets\companies-data.xlsx file from the tutorial resources.
Confirm the import options, mapping, and then validate the import summary.
Review the imported data and then click FINISH.
When the toaster indicates "Changes successfully applied", click on CLICK TO REFRESH:
Click on Companies (Golden) in the Navigation Drawer to open the enriched golden data.
Observe that the list of golden records now contains 54 records instead of 74 previously. This means that the new matching rules using the Normalized and Phonetic Names allowed to match and merge more records.
Sort records by Company and scroll until Freight Ways. Note that the FrateWize record is not visible anymore. Let's check that it was merged with Freight Ways thanks to their matching Phonetic Names:
Click on the Freight Ways record.
On the record's browsing form, click the Options button (vertical ellipsis) and select Explain Record in the menu.
The golden record's explanation confirms that it was consolidated from the Freight Ways and Frate Wize master records:
Move your mouse over the golden record tile to see information about that record. You can see that this record has a Score of 88 and it is still to confirm (suggested match). Move your mouse over the blue line between the master records to see the rule that matched them (PhoneticName).
Click on Companies (Golden) in the Navigation Drawer to return to the list of golden records. Observe that many records were failing to match because the legal entity type was present in one source but not another:
Apple vs Apple Inc.
AT&T vs AT&T Inc.
CVS Caremark vs CVS Caremark, L.L.C vs CVS Caremark,, LLC
...
You will learn how to fix this in the next step.
Congratulations
You have successfully configured match rules.
What we've covered
Adding match rules
Configuring thresholds
Reloading data to see the impact of your match rules
In the next step, you will improve the matching.
To improve the matching, we will now add a new enricher which removes business entity type, and adjust the sequence of execution for enrichers as follows:
Remove punctuation
Remove business entity type (new)
Phoneticize
This will result in an improved NormalizedName and will get your matching to a very interesting level.
What you'll learn
Improving name standardization
Changing the sequence of execution for your enrichers
Add the Remove Business Entity Type enricher
Start by adding the new enricher that removes business entity types:
Return to the Application Builder and open the Design perspective.
Under the Company entity, right-click on Enrichers and select Add SemQL Enricher:
Enter RemoveBusinessEntityType as the Name and click Next.
Define the attribute to enrich: select the NormalizedName attribute, click the Add>> button then Finish.
Click the Edit Expression button next to the NormalizedName attribute in the Enricher Expressions section of the enricher's editor.
Copy and paste the SemQL expression (note that it enriches from NormalizedName back into NormalizedName), and then click OK:
REGEXP_REPLACE (
NormalizedName,
'[[:blank:]]+(' /* find spaces preceding the business entity type */
|| 'COMPANY|COMPANY[[:blank:]]*INC|CORPORATION|CORP|AND[[:blank:]]*CO|CO|CO[[:blank:]]*INC|INCORPORATED|INC|LIMITED|LTD|ASSOCIATION|ASSOC|' /* Anglo entities */
|| 'L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*C|P[[:blank:]]*L[[:blank:]]*C|' /* More Anglo entities */
|| 'S[[:blank:]]*A[[:blank:]]*R[[:blank:]]*L|S[[:blank:]]*A[[:blank:]]*S|' /* French entities */
|| 'A[[:blank:]]*G|G[[:blank:]]*M[[:blank:]]*B[[:blank:]]*H|' /* German entities */
|| 'S[[:blank:]]*A|S[[:blank:]]*L|' /* Spanish entities */
|| 'K[[:blank:]]*K|G[[:blank:]]*K|' /* Japanese entities */
|| ')[[:blank:]]*$',
'' /* replace the business entity type with an empty string */
)
Save your work.
Change the sequence of enrichers
By default, the new enricher is executed in the last position. We will now change the sequence so that RemoveBusinessEntityType is executed after RemovePunctuation and before PhoneticizeName.
Double-click on Enrichers in the Model Design View.
Select the new enricher RemoveBusinessEntityType and click the Move up button once to move it between RemovePunctuation and PhoneticizeName.
Save your work.
Deploy, reload and test
Validate the model.
Deploy the model.
Return to the B2BTutorial application and repeat the same steps to reload data:
Delete all data from the Companies (Golden) collection.
Go to Companies (Master) and import the data-consolidation\datasets\companies-data.xlsx file from the tutorial resources again.
Go to Companies (Golden) to see the resulting golden records:
The list is now composed of 40 golden records (compared to 54 previously), which means that more Masters could be matched and merged.
Observe that there is only one golden record for Apple, AT&T, CVS Caremark (against 2 previously because of the different entity types).
Open the CVS Caremark,, LLC record and select Explain Record in the actions menu. Observe that the new enricher allowed to match CVS Cavemark, LLC with CVS Cavemark.
Display the matching results
You already browsed through the golden data to see which records merged.
Semarchy xDM tracks lots of additional metadata about the matches. You will make some of it available to data stewards:
Return to the Application Builder.
Go to Entities>Company>Collections, and double-click on CompanyCollection to open the Collection editor.
Drag and drop the Confirmation Status attribute from the Attributes list to the list of columns of the CompanyCollection.
Repeat the same operation for the following columns:
Confidence Score
Masters Count
Has Suggested Merge
Save the collection.
Match rules were not changed, so there is no need to redeploy but only to refresh the application:
Return to the B2BTutorial application.
Open the user menu on the upper right corner and select Refresh application.
Click on Companies (Golden) in the Navigation Drawer.
Scroll right to see the new columns:
Confirmation Status: indicates whether the golden record has its masters confirmed (partially or entirely) or not.
Confidence Score: the average of the match scores in the match group.
Masters Count: the number of master records contributing to the golden record.
Has Sugg. Merge: indicates whether match and merge suggestions are available for this record.
Congratulations
You have successfully refined match rules with the help of a new enricher. Your model is now able to make matches with all of these variations:
What we've covered
You created an additional enricher to improve the matching
You optimized the sequence of execution for your enrichers
You displayed some of the matching metadata
The next step will focus on survivorship rules to pick the best golden values and to also allow data stewards to override values when needed.
Semarchy xDM provides a platform for the developer to define survivorship rules that select the best value from the various source records that match and merge.
In addition to these automated rules, the xDM application gives data stewards the ability to override values when the default survivorship rules are not optimal.
You will discover a lot of exciting features in the next steps to make sure that your mastered data is as good as it can possibly be.
Preferred Publisher consists of ordering publishers manually and is perhaps the most common strategy used. It is appropriate for any field when one source system is the system of record for that item.
Semarchy xDM will apply the Default Rule to any attributes that are not explicitly defined to another rule. In this step, you will define explicit survivorship rules for groups of related attributes.
Id survivorship rule
The Id Survivorship Rule defines which source ID is the "winner" in the golden record. This technical rule is generally transparent to your end-users.
It's important in cases like this: a data steward creates a new Contact in xDM and links it to a golden customer. Later over-matching is detected, and the Customer is split into two Customers. Which new Customer should the contact now belong to? The Contact follows the winning source ID.
Configure the address survivorship rule
In the case of addresses, we want to implement the following rule:
In the US, we trust the addresses from the ERP most.
Outside the US, we trust addresses that come from the WEB publisher. If we don't have one of these, then we should trust FINANCE followed by CRM.
If we have two records from the most trusted source, we should use the latest record.
You will use a Custom Ranking to create this variation on the standard Preferred Publisher Strategy.
Return to the Application Builder.
Right-click on Survivorship Rules and select Add Survivorship Rule.
Enter Address as the Name and click Finish.
Click Define Survivorship Rule Attributes.
Select and add the following attributes to the Displayed Attributes list by clicking the Add button, and then click Finish:
Address1
City
State
PostalCode
Country
Scroll down to the Consolidation Rule section.
Keep Custom Ranking as the Consolidation Strategy.
Click the Edit Expression button next to the Ranking Expression field.
Enter the following expression and then click OK:
CASE
WHEN Country = 'US' AND PublisherID = 'ERP' THEN 1
WHEN Country = 'US' AND PublisherID != 'ERP' THEN 2
WHEN Country != 'US' AND PublisherID = 'WEB' THEN 1
WHEN Country != 'US' AND PublisherID != 'WEB' THEN 2
ELSE 3
END,
CASE
WHEN PublisherID = 'FINANCE' THEN 1
WHEN PublisherID = 'CRM' THEN 2
ELSE 3
END,
UpdateDate DESC
Select Override - until consolidated value changes in the Override Strategy
Save your work.
Configure the Finance owned survivorship rule
In the context of our B2B application, the Finance team is responsible for these attributes:
CompanyName
Region
Website
Industry
Data stewards are allowed to override. But if the Finance team makes a subsequent update, then they should start winning again. That's why we will use Override - until consolidated value changes as Override Strategy.
Add a new survivorship rule as you did in the previous section. Enter FinanceOwned as the Name:
Click Define Survivorship Rule Attributes:
Add the following attributes and then click Finish:
CompanyName
Region
Website
Industry
In the Consolidation Rule section:
Select Preferred Publisher as the Consolidation Strategy.
Click the Select Publishers button next to the Preferred Publishers property.
Add publishers in the following order and then click Finish:
FINANCE
WEB
ERP
CRM
Select Override - until consolidated value changes as the Override Strategy in the Override Rule section.
Save your work.
Deploy and reload
To apply the new survivorship rules, you must deploy the changes and reload data. Repeat the steps that you performed when you configured the match rules:
Validate the model and check that no error is raised.
Deploy the model.
Return to the B2BTutorial application and repeat the same steps to reload the data:
Delete all data from the Companies (Golden) collection.
Go to Companies (Master) and import the data-consolidation\datasets\companies-data.xlsx file from the tutorial resources.
In the next step, you will connect to your application to observe the impact of your new rules.
Congratulations
You have successfully completed the fourth step of this tutorial by defining the survivorship rules.
Now that survivorship rules are defined, your golden records will be correctly calculated from each cluster of master records that get matched together, using data from the most relevant publisher for each attribute.
What we've covered
Creating a Preferred Publisher rule
Creating a Custom Ranking rule
Defining the Override Strategy
The next part will reveal all your rules in action in your application.
To finish this tutorial, you will use the Entities view to see the results of your match rules and survivorship rules.
The Entities view is provided by default with all applications to browse, for all entities, the following data:
Golden data
Master data
Source data (loaded from publishers)
Source authoring (authored by users)
Records (golden, source) with errors
This view provides lots of details and is very useful at design time. You can customize later your Business Views to display exactly what your business users want to see.
Let's now use the Entities view:
In the B2BTutorial application, click on Entities in the Navigation Drawer, and then click the Golden Data tile.
Click the Companies tile:
The list of golden records is displayed.
Click on any column header (e.g Company Name) to sort records.
To select and reorder columns, proceed as follows:
Open the actions menu of the list:
Click Select columns:
Select or unselect columns to show or hide them;drag and drop them to change their order. Click APPLY to go back and refresh the list.
Scroll horizontally to see all attributes from the entity:
Business data: Company Name, Address1...
Attributes populated by enrichers: Normalized Name/Phonetic Name
Technical attributes: creation/last update date and user, Batch ID
Match and merge metadata: Confidence Score, Masters Count...
Identify the Masters Count column. This column indicates how many master records are linked to a golden record. Click twice on its header to sort the table in descending order. Click on the highest value (first row) to open the record.
Select the MASTER RECORDS tab to see all master records this record is linked to. Observe that the values highlighted in blue are the ones that contributed to the golden record, as per the survivorship rules you have configured. For instance, the values of Company Name, Website, Industry in the golden record are coming from the FINANCE publisher which was set as the preferred publisher for them.
Click the Next (>) button to go to the next record.
Observe how this second record was consolidated, following the survivorship rules:
Address attributes were consolidated according to the Address survivorship rule.
CompanyName, Website, Industry were consolidated as per the Finance owned rule.
You can browse and review other records by clicking the Previous (<) and Next (>) buttons.
Excellent job following this tutorial! With this unit, you learned to enrich, match and merge data within your first data consolidation application.
What we've covered
You learned to create match rules.
You learned to create enrichers to facilitate these match rules.
You learned to define survivorship rules and select the best value from the various source records which get matched together.
What's next?
If you have not done so already, you can follow the Data Authoring track and create your first data authoring application, or the Integration track to learn how to query and load data via using the SQL and REST APIs.