Matching records

The matching process involves comparing pairs of records using multiple match rules to determine if they are duplicates.

The actual matching phase may be preceded by a binning phase.

Matching logic elements: Record1 and Record2

Match rules operate on two pseudo-records: Record1 and Record2. In the context of the certification process, Record1 is an alias for any master record either created or updated by the current data load, while Record2 is an alias for any master record, including those created or updated by the current load.

Understanding the difference between Record1 and Record2 helps in designing match rules that either maintain consistency or require additional safeguards to ensure reliable data matching and consolidation. For more information, see Advanced match rules.

Binning

Binning groups records into smaller, more manageable sets, or bins. This approach reduces the number of comparisons needed by narrowing down the search scope.

It is achieved through multiple SemQL expressions specified in a match rule. Records for which all binning expressions produce identical results are placed in the same bin.

While binning is a valuable tool that makes the matching logic conceptually simpler, it does not necessarily enhance performance. Its effectiveness relies on using selective binning attributes and having custom indexes on the related columns in the database.
As a matter of fact, direct attribute comparisons in the match rule often provide better performance. For instance:

Record2.Attr1 IS NOT NULL AND Record2.Attr1 = Record1.Attr1

Effective use of binning

Instead of matching all records globally, binning allows for:

Matching records based on a specific attribute (e.g., matching customers only when they are located in the same country).
Further narrowing the binning results by combining attributes or specific functions (e.g., matching customers within the same country and sales region).

Example. Binning by country and sales region

To bin customers according to their country and sales region in the GeocodedAddress complex field, you could use the following expressions:

Binning expression #1: GeocodedAddress.Country
Binning expression #2: GeocodedAddress.Region

Smaller bins lead to faster processing, but designers should make sure that binning does not exclude potential matches.

Example. Incorrect binning

Using the first four letters of a last name for binning can separate similar records into different bins, leading to missed matches. In such a case, Jones-Smith, Bill and Jonnes-Smith, Bill would end up in different bins and not be matched.

To prevent this issue, designers could use more appropriate attributes or apply phonetization techniques like SOUNDEX or METAPHONE for binning. These techniques help capture similar-sounding names, ensuring that potential matches are not overlooked due to minor variations in spelling.

Matching

During the matching phase, a SemQL condition compares all the pairs of records within a bin. This condition uses Record1 and Record2, which corresponds to the two records being matched. If the SemQL condition returns true, records are considered duplicates.

Sample matching condition (PostgreSQL)

The following matching condition matches pairs of customers who meet one of these criteria:

Their names sound the same in English (they are phonetized using PostgreSQL’s built-in double metaphone function), or
They have multiple elements of similarity:
- Their names are over 80% similar (using Semarchy edit distance function), and
- Their city name and address are more than 65% similar.

( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName)
OR
SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName) > 80 )
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.City, Record2.InputAddress.City) > 65

Overmatching and undermatching

The level of overmatching and undermatching that is acceptable depends on the requirements of each data management project and should be carefully considered, as it can have a significant impact in terms of performance and efficiency.

Undermatching happens when matching conditions are too strict, which can result in an excessive amount of duplicate records. While not causing technical problems, actual matches may be overlooked in the process, ultimately affecting data quality.
Overmatching happens when matching conditions are too lax, which results in excessively large clusters of matching records that are impossible to manage, either manually or automatically.

Overmatching is a common cause of performance bottlenecks during the matching phase, and can trigger overflow errors due to database limitations during the consolidation process. For this reason, defining match rules that would combine an excessive number of master records (in the range of 1,000) into a single golden record is highly discouraged.

Additionally, match groups containing several thousand records are considered excessively large clusters and are likely to cause out-of-memory errors. These errors may disrupt the matching or consolidation processes, leading to task failures. To maintain reliable operation and prevent system strain, it is crucial to design match rules that generate manageable match groups and align with available system resources, such as memory and database capacity.