Matching records

The matching process involves comparing pairs of records using multiple match rules to determine if they are duplicates.

The actual matching phase may be preceded by a binning phase.

Binning

Binning groups records into smaller, more manageable sets, or bins. This approach reduces the number of comparisons needed by narrowing down the search scope.

It is achieved through multiple SemQL expressions specified in a match rule. Records for which all binning expressions produce identical results are placed in the same bin.

While binning is a valuable tool that makes the matching logic conceptually simpler, it does not necessarily enhance performance. Its effectiveness relies on using selective binning attributes and having custom indexes on the related columns in the database.
As a matter of fact, direct attribute comparisons in the match rule often provide better performance. For instance:

Record2.Attr1 IS NOT NULL AND Record2.Attr1 = Record1.Attr1

Effective use of binning

Instead of matching all records globally, binning allows for:

  • Matching records based on a specific attribute (e.g., matching customers only when they are located in the same country).

  • Further narrowing the binning results by combining attributes or specific functions (e.g., matching customers within the same country and sales region).

Example. Binning by country and sales region

To bin customers according to their country and sales region in the GeocodedAddress complex field, you could use the following expressions:

  • Binning expression #1: GeocodedAddress.Country

  • Binning expression #2: GeocodedAddress.Region

Smaller bins lead to faster processing, but designers should make sure that binning does not exclude potential matches.

Example. Incorrect binning

Using the first four letters of a last name for binning can separate similar records into different bins, leading to missed matches. In such a case, Jones-Smith, Bill and Jonnes-Smith, Bill would end up in different bins and not be matched.

To prevent this issue, designers could use more appropriate attributes or apply phonetization techniques like SOUNDEX or METAPHONE for binning. These techniques help capture similar-sounding names, ensuring that potential matches are not overlooked due to minor variations in spelling.

Matching

During the matching phase, a SemQL condition compares all the pairs of records within a bin. This condition uses two pseudo records, named Record1 and Record2, corresponding to the two records being matched. If the SemQL condition returns true, records are considered duplicates.

Sample matching condition (PostgreSQL)

The following matching condition matches pairs of customers who meet one of these criteria:

  • Their names sound the same in English (they are phonetized using PostgreSQL’s built-in double metaphone function), or

  • They have multiple elements of similarity:

    • Their names are over 80% similar (using Semarchy edit distance function), and

    • Their city name and address are more than 65% similar.

( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName)
OR
SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName) > 80 )
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.City, Record2.InputAddress.City) > 65
Overmatching and undermatching

The level of overmatching and undermatching that is acceptable depends on the requirements of each data management project and should be carefully considered, as it can have a significant impact in terms of performance and efficiency.

  • Undermatching happens when matching conditions are too strict, which can result in an excessive amount of duplicate records. While not causing technical problems, actual matches may be overlooked in the process, ultimately affecting data quality.

  • Overmatching happens when matching conditions are too lax, which results in excessively large clusters of matching records that are impossible to manage, either manually or automatically.

Overmatching is a common cause of performance bottlenecks during the matching phase, and can trigger overflow errors due to database limitations during the consolidation process. For this reason, designers should avoid defining match rules that would combine an excessive number of master records (in the range of 1,000) into a single golden record.