In this tutorial, you will learn how to create robust match rules and enrichers to facilitate these match rules. You will also define some custom survivorship rules. These activities are essential to ensure that data coming from different publishers is matched and optimally consolidated.

Learning outcomes

Learning track

This tutorial is the second unit of the Data Consolidation track. If not done yet, follow the first unit, Create your first data consolidation application.

If you have not completed this prerequisite, return to the Tutorials menu.

GO TO TUTORIALS

Otherwise, enjoy this tutorial!

Laying the groundwork

This tutorial provides guidelines for conducting effective matching on business-to-business (B2B) data. You will grasp the mechanics of crafting match rules tailored to your data requirements.

In a real data management project, you will need to conduct some preparatory work. Consider the following questions:

Semarchy xDM enables the creation of enrichers to normalize, standardize, and enrich data loaded into or authored within your xDM application. This is a key success factor for matching as it allows working with cleaner, richer, and more standardized data.

This section will guide you through the process of creating a SemQL enricher to standardize company names.

Learning outcomes

Add the NormalizedName attribute

Before setting up an enricher, you first need to add an attribute to the Company entity to store standardized company names.

  1. From the Designer perspective of the Application Builder, expand Entities > Company > Attributes.
  2. Right-click on Attributes and select Add Simple Attribute.

  1. Enter NormalizedName in the Name field.
  2. Keep the other default values, and then click Finish.

  1. Repeat the procedure above to create the PhoneticName attribute.
  2. Save the model.

Create an enricher

We will now add a SemQL enricher to perform basic data standardization on CompanyName. The goal is to remove punctuation.

  1. Under the Company entity, right-click on Enrichers, and select Add SemQL Enricher.

  1. Enter RemovePunctuation in the Name field and click Next.

  1. Define the attribute to enrich: select the NormalizedName attribute, click the Add >> button, and then click Finish.

  1. In the Enricher Expressions section, click the Edit Expression button next to the NormalizedName attribute.

  1. Copy and paste the SemQL expression for your database (or build it by double-clicking REGEXP_REPLACE in the Functions list and setting the parameters), and then click OK.
REGEXP_REPLACE( CompanyName, '[[:punct:]]', ' ', 'g' )
REGEXP_REPLACE( CompanyName, '[[:punct:]]', ' ' )

  1. Click Save to save the enricher.

Add the new attributes to the Company collection

  1. Expand the Collections node under the Company entity and double-click on CompanyCollection.

Deploy the model changes

You first need to deploy the changes you just performed on your model.

  1. Go to the Management perspective.
  2. Right-click on B2BTutorial and select Deploy Model Edition.

  1. Click Next.

  1. Click Finish to deploy the listed changes to the data location.

  1. Wait until all changes are deployed.

Reload data

Executing the enricher on all existing data necessitates reloading the data. In this section, you will use the xDM user interface to delete and reload data. If you are loading from a data integration tool, you may use it to truncate and reload data.

  1. Return to the Welcome page and click on B2BTutorial.

  1. Click on Companies (Golden) in the navigation drawer.
  2. Click the Select all checkbox in the collection header to select all records, open the action menu, and then click Delete.

  1. Click on Companies (Master) in the navigation drawer.
  2. Open the action menu of the Companies collection, and then select Import Master Records.

  1. From the tutorial resources, select the
    data-consolidation\datasets\companies-data.xlsx file from the tutorial resources.
  2. Click Continue.

  1. Click Continue on the next screen to confirm the mapping.
  2. Click Finish on the Import Summary page.

  1. Review the imported data and click Finish.

  1. When the toaster indicates "Changes successfully applied," click on Click to refresh to see the data you just loaded.

  1. See that the NormalizedName column is now populated. For instance, the normalized name for FreightWise, Inc. is FreightWise Inc.

  1. Click on Companies (Golden) in the navigation drawer. The normalized name is also available on golden records.

Improve company name standardization

Your enricher overlooks an important special case: the character "&" is treated like any other, but has a meaning that you want to preserve. For example, we aim for "Ben and Jerry's" and "Ben & Jerrys" to be regarded as a strong match and merged accordingly, potentially without human confirmation. This is a very common special case in match rules. Therefore, you will enhance the enricher to substitute "&" with "AND" before eliminating any punctuation.

  1. Click on your avatar picture on the top-right corner of the screen to open the user menu, and select All applications.

  1. Open the Application Builder and select the B2BTutorial [0.0] model.
  2. Expand Entities > Company > Enrichers, and double-click on RemovePunctuation.

  1. Click the Edit expression button next to the enricher's expression.

  1. Replace the existing expression with the following one (depending on your database) and click OK. Observe that this expression will change company names to uppercase, remove punctuation and replace "&" with "AND".
REGEXP_REPLACE( 
        REPLACE( UPPER(CompanyName), '&', ' AND ' ),
        '[[:punct:][:blank:]]+', /* find one or more spaces or punctuation marks */
        ' ',                     /* replace with a single space */
        'g'                      /* replace all occurrences */
)
REGEXP_REPLACE( 
        REPLACE( UPPER(CompanyName), '&', ' AND ' ),
        '[[:punct:][:blank:]]+', /* find one or more spaces or punctuation marks */
        ' '                      /* replace with a single space */
)

  1. Save your work.

You will observe the benefits of this enhancement when you reload data in the next section.

Congratulations!

You have successfully added the first enricher.

To summarize:

In the next section, you will add another enricher for name phonetization.

In this section, you will add another enricher to phoneticize names and help matching records.

Learning outcomes

Add the phonetization enricher

You will now use the Semarchy xDM Text Enricher plug-in to phoneticize company names. This is an extremely useful technique depending on where your source errors come from.

  1. Under the Entities > Company node, right-click on Enrichers and select Add API Enricher.

  1. Enter the following values in the wizard:

  1. Click Finish.
  2. Configure the transformation: scroll down to the Plug-in Params section. Paste the following value in the Transformation parameter.
DOUBLEMETAPHONE 12 TRUE

  1. In the Plug-in Inputs section, click the Edit expression button for the Input Text row.

  1. Double-click on PhoneticName in the Attributes list to add it to the expression and then click OK.

  1. In the Outputs section, click Define Outputs.

  1. Add the PhoneticName attribute and select Transformed Text as the Output Name.

  1. Save your work and validate the model. No error or warning should be reported.

Deploy the model changes

Repeat the same steps to deploy the model changes:

  1. Go to the Management perspective.
  2. Right-click on B2BTutorial and select Deploy Model Edition.

  1. Click Next.

  1. Click Finish to deploy the changes to the data location.

Reload data

You now need to reload data to see the improved name standardization and the name phonetization.

Repeat the same operations as previously:

  1. Return to the Welcome page and open the B2B Tutorial application again.
  2. Click on Companies (Golden) in the navigation drawer.
  3. Select the Select all checkbox in the collection header to select all records.
  4. Open the action menu, and click on Delete.

  1. Select Companies (Master) in the navigation drawer.
  2. Open the actions menu of the Companies collection, and then select Import Master Records.

  1. Select the data-consolidation\datasets\companies-data.xlsx file from the tutorial resources.
  2. Confirm the import options, mapping, and validate the import summary.
  3. Review the imported data and click Finish.

  1. When the toaster indicates "Changes successfully applied," click on Click to refresh.

  1. Click on Companies (Golden) in the navigation drawer to see the enriched golden data. Observe that:

  1. Click on the Company column header to sort results alphabetically.

  1. Browse the records and identify duplicates that could be detected by match rules using the NormalizedName or the PhoneticName attributes:

Congratulations!

You have successfully added a plug-in enricher.

To summarize:

The next section will focus on updating match rules to make the most of these enriched fields.

Semarchy xDM offers a robust feature for defining multiple match rules with varying match rules and merge policies to determine the actions taken on clusters of potential matches as they transform into golden records.

Learning outcomes

Multiple match rules and scores

To enhance the matching process, we aim to implement multiple rules and assign distinct confidence scores to each:

Add match rules

You are now going to add match rules that will leverage the normalized and phonetic names you added previously.

Add a match rule for normalized names

  1. To open the existing SemQL matcher that you initialized during the previous tutorial unit, Create your first data consolidation application, in the Application Builder, expand Entities > Company > Matcher, and then double-click on SemQLMatcher - Company.

  1. Click Add Match Rule on top of the Match Rules list.

  1. Enter the following parameters:

  1. Scroll down to the Matching section:

  1. Paste the following expression in the SemQL editor and click OK.
Record1.NormalizedName = Record2.NormalizedName 
AND ( 
  Record1.PostalCode = Record2.PostalCode
  OR Record1.PostalCode IS NULL OR Record2.PostalCode IS NULL
)
AND Record1.Country = Record2.Country

6. Click Save.

Add a match rule for phonetic names

Add a rule for the PhoneticName attribute by following the same procedure:

  1. Click on SemQLMatcher - Company in the breadcrumb to return to the matcher's configuration.

  1. Click Add Match Rule on top of the Match Rules list.

  1. Enter the following values:
Record1.PhoneticName = Record2.PhoneticName 
AND ( 
  Record1.PostalCode = Record2.PostalCode
  OR Record1.PostalCode IS NULL OR Record2.PostalCode IS NULL
)
AND Record1.Country = Record2.Country

  1. Save the rule.
  2. Click again on SemQLMatcher - Company in the breadcrumb to return to the matcher's configuration. Review your matcher, which is now composed of three rules:

Configure thresholds

Now that new match rules have been defined to identify potential matches, we do not want all matches to merge automatically anymore.

Modify the thresholds for the Company SemQLMatcher accordingly:

  1. Within the SemQLMatcher editor, scroll down to the Merge Policy section.
  2. Enter the following thresholds to define when records will be merged automatically:

  1. Within the Auto-Confirm Policy section, set the Auto-confirm golden records property to 91. This means that only golden records with a confidence score of 91 or higher will be confirmed and merged automatically.

Deploy, reload and test

  1. Save your work and validate the model.

  1. Deploy the model: go to the Management perspective, right-click on the B2BTutorial Data Location, and then select Deploy Model Edition:

  1. Reload data by repeating the same operations as in ⓵ Standardize company names with an enricher.

  1. Click on Companies (Golden) in the navigation drawer to open the enriched golden data.
  2. Observe that the list of golden records now includes 54 records instead of 74 previously. This means that the new matching rules using the normalized and phonetic names allowed to match and merge more records.

  1. Sort records by Company and scroll until Freight Ways.
  2. Note that the FrateWize record is not visible anymore. Verify that it was merged with Freight Ways based on their matching phonetic names:

  1. Click on Companies (Golden) in the navigation drawer to return to the list of golden records. Observe that many records have failed to match because the legal entity type was present in one source, but not in another:

You will learn how to fix this in the next section.

Congratulations!

You have successfully configured match rules.

To summarize:

Next, you will improve the matching process.

To improve matching, we will now add a new enricher which removes business entity type, and adjust the sequence of execution for enrichers as follows:

  1. Remove punctuation.
  2. Remove business entity type (new).
  3. Phoneticize.

This will result in an improved normalized name and will get your matching to a very interesting level.

Learning outcomes

Add the RemoveBusinessEntityType enricher

Start by adding the new enricher that removes business entity types:

  1. Return to the Application Builder and open the Design perspective.
  2. Under the Company entity, right-click on Enrichers and select Add SemQL Enricher.

  1. Enter RemoveBusinessEntityType in the Name field and click Next.

  1. Define the attribute to enrich: select the NormalizedName attribute, click the Add >> button, then click Finish.

  1. Click the Edit Expression button next to the NormalizedName attribute in the Enricher Expressions section of the enricher's editor.

  1. Copy and paste the SemQL expression (note that it enriches NormalizedName back to NormalizedName), and then click OK.
REGEXP_REPLACE (
    NormalizedName,
    '[[:blank:]]+(' /* find spaces preceding the business entity type */
     || 'COMPANY|COMPANY[[:blank:]]*INC|CORPORATION|CORP|AND[[:blank:]]*CO|CO|CO[[:blank:]]*INC|INCORPORATED|INC|LIMITED|LTD|ASSOCIATION|ASSOC|' /* Anglo entities */
     || 'L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*L[[:blank:]]*P|L[[:blank:]]*L[[:blank:]]*C|P[[:blank:]]*L[[:blank:]]*C|' /* More Anglo entities */
     || 'S[[:blank:]]*A[[:blank:]]*R[[:blank:]]*L|S[[:blank:]]*A[[:blank:]]*S|' /* French entities */
     || 'A[[:blank:]]*G|G[[:blank:]]*M[[:blank:]]*B[[:blank:]]*H|'              /* German entities */
     || 'S[[:blank:]]*A|S[[:blank:]]*L|'                                        /* Spanish entities */
     || 'K[[:blank:]]*K|G[[:blank:]]*K|'                                        /* Japanese entities */
     || ')[[:blank:]]*$',
    ''  /* replace the business entity type with an empty string */
)

  1. Save your work.

Change the enricher sequence

By default, the latest enricher is executed last. We will now change the sequence so that RemoveBusinessEntityType is executed after RemovePunctuation and before PhoneticizeName.

  1. Double-click on Enrichers in the Model Design view.

  1. Select the new enricher RemoveBusinessEntityType and click the Move up button once to move it between RemovePunctuation and PhoneticizeName.

  1. Save your work.

Deploy, reload, and test

  1. Validate the model.

  1. Deploy the model.

  1. Return to the B2B Tutorial application and repeat the same steps to reload data:

  1. Go to Companies (Golden) to see the resulting golden records:

Display the matching results

You already reviewed the merged records in the golden data. Semarchy xDM keeps track of various additional metadata regarding the matches, which you will now make accessible to data stewards.

  1. Return to the Application Builder.
  2. Go to Entities > Company > Collections, and double-click on CompanyCollection to open the Collection editor.

  1. Drag and drop the Confirmation Status attribute from the Attributes list to the list of columns of the CompanyCollection.

  1. Repeat the same operation for the following columns:
  1. Save the collection.
  2. Match rules were not changed, so there is no need to redeploy. SImply refresh the application:

  1. Click on Companies (Golden) in the navigation drawer.
  2. Scroll right to see the new columns:

Congratulations!

You have successfully refined match rules with the help of a new enricher. Your model is now able to make matches with all of these variations:

To summarize:

The next section will focus on survivorship rules to select the optimal golden-record values and enable data stewards to override values when needed.

Semarchy xDM offers developers a platform to define survivorship rules, which determine the best value from multiple source records that match and merge. Alongside these automated rules, the xDM application empowers data stewards to override values when the default survivorship rules are inadequate. In the upcoming sections, you will explore numerous interesting features to ensure that your mastered data reaches its highest quality potential.

Learning outcomes

Default survivorship rules

Default rule

In the previous unit, Build your first data consolidation application, you established the default survivorship rule using a preferred publisher consolidation strategy. This strategy involves manually prioritizing publishers and is widely used, particularly for fields where one source system serves as the system of record for that item. Semarchy xDM automatically applies the default rule to any attributes that are not explicitly assigned to another rule. Now, in this section, you will define explicit survivorship rules for groups of correlated attributes.

ID survivorship rule

The ID survivorship rule determines the winning source ID in the golden record. Typically, this technical rule is transparent to end-users. Its significance becomes apparent in scenarios such as when a data steward links a new Contact to a Customer golden record in xDM. Later, if over-matching occurs and the Customer record is split into two records, the Contact data remains associated with the winning source ID.

Configure the address survivorship rule

For addresses, we want to implement the following rule:

To implement this modification of the standard preferred publisher strategy, you will employ a custom ranking approach.

  1. Return to the Application Builder.
  2. Right-click on Survivorship Rules and select Add Survivorship Rule.

  1. Enter Address in the Name field and click Finish.

  1. Click Define Survivorship Rule Attributes.

  1. Select and add the following attributes to the Displayed Attributes list by clicking the Add button:
  1. Click Finish.

  1. Scroll down to the Consolidation Rule section.
  2. Keep Custom Ranking as the consolidation strategy.
  3. Click the Edit Expression button next to the Ranking Expression field.

  1. Enter the following expression and then click OK.
CASE 
  WHEN Country  = 'US' AND PublisherID  = 'ERP' THEN 1
  WHEN Country  = 'US' AND PublisherID != 'ERP' THEN 2
  WHEN Country != 'US' AND PublisherID  = 'WEB' THEN 1
  WHEN Country != 'US' AND PublisherID != 'WEB' THEN 2
  ELSE 3
END,
CASE 
  WHEN PublisherID = 'FINANCE' THEN 1
  WHEN PublisherID = 'CRM' THEN 2
  ELSE 3
END,
UpdateDate DESC

  1. Select Override - until consolidated value changes in the Override Strategy.

  1. Save your work.

Configure the Finance-owned survivorship rule

In the context of our B2B Tutorial application, the Finance team is responsible for these attributes:

Data stewards are allowed to override them. However, if the Finance team provides an updated value, their input should take precedence once again. Hence, we will employ the Override - until consolidated value changes strategy.

  1. Add a new survivorship rule as you did in the previous section.
  2. Enter FinanceOwned in the Name field.

  1. Click Define Survivorship Rule Attributes.

  1. Add the following attributes:
  1. Click Finish.

  1. In the Consolidation Rule section:

  1. Add publishers in the following order and then click Finish:

  1. Select Override - until consolidated value changes as the override strategy in the Override Rule section.

  1. Save your work.

Deploy and reload

To apply the new survivorship rules, you must deploy the changes and reload data. Repeat the steps that you performed when you configured the match rules:

  1. Validate the model and check that no error is raised.

  1. Deploy the model.

  1. Return to the B2B Tutorial application and repeat the same steps to reload the data:

In the next section, you will connect to your application to observe the impact of your new rules.

Congratulations!

You have successfully defined survivorship rules. Now, your golden records will be correctly calculated from each cluster of master records that are matched together, using data from the most relevant publisher for each attribute.

To summarize:

The next section will demonstrate how all your rules function within your application.

You will now use the Entities view to observe the results of your match and survivorship rules.

The Entities view is provided by default with all applications to browse, for all entities, the following data:

This view provides extensive details and proves highly beneficial during the design phase. Later on, you can customize your business views to showcase precisely what your business users require.

To use the Entities view:

  1. In the B2B Tutorial application, click on Entities in the navigation drawer, and then click the Golden Data tile.

  1. Click the Companies tile.

  1. The list of golden records is displayed.

  1. Click on any column header (e.g., Company Name) to sort records.

  1. To select and reorder columns, proceed as follows:

  1. Scroll horizontally to see all attributes from the entity:
  1. Locate the Masters Count column. This column indicates how many master records are linked to a golden record.
  2. Click twice on its header to sort the table in descending order.
  3. Click on the highest value (first row) to open the record.

  1. Select the Master records tab to see all master records this record is linked to. Observe that the values highlighted in blue are the ones that contributed to the golden record, as per the survivorship rules you have configured. For instance, the values of Company Name, Website, Industry in the golden record are coming from the FINANCE publisher which was set as the preferred publisher for them.

  1. Click the Next (>) button to go to the next record.

  1. Observe how this second record was consolidated, according to the survivorship rules:

  1. You can browse and review other records by clicking the Previous (<) and Next (>) buttons.

Well done! In this unit, you have learned to enrich, match, and merge data within your first data consolidation application.

Learning recap

Next steps

If you have not done so already, you can follow the Data Authoring track and create your first data authoring application, or the Data Publishing & Consumption track to learn how to query and load data using the SQL and REST APIs.

GO TO TUTORIALS

Thank you for completing this tutorial.