Matching records

To perform the matching process, pairs of records are compared using multiple match rules, to decide whether they are duplicates.

Matching phases

Each match rule runs as a two-phase logical process:

  • Binning: this phase uses a set of SemQL expressions to group the records into bins. Binning is a "divide and conquer" approach to avoid an excessive number of comparisons.

  • Matching: this phase uses a SemQL condition that compares all the pairs of records (Record1 is compared with Record2) within a bin. When this condition returns true, the pair of records are considered as duplicates.

Binning

The binning phase divides the source records into bins to allow the matching phase to run only within a given bin. As the matching phase can be a resource-consuming one, reducing the number of comparisons is important for high-performance matching.

For example:

  • Instead of trying to match all customers, we will try to match customers only when they are located in the same country. Binning will take the Country attribute as the binning expression.

  • If we want to make the matching phase even more efficient, we could match customers with the same Country and SalesRegion.

Binning is completed using several SemQL expressions defined in the match rule. The records for which all binning expressions give the same results belong to the same bin.

Binning by Country and Region

To perform binning for Customers with the same Country and Region’s first letter in the GeocodedAddress complex field, we would use:

  • Binning Expression #1: GeocodedAddress.Country

  • Binning Expression #2: SUBSTR(GeocodedAddress.Region,1,1)

Smaller Bins will mean faster processing, but, you must make sure that binning does not exclude possible matches.
Incorrect binning

Binning persons by the first four letters of their last name is not a good choice: Jones-Smith, Bill and Jonnes-Smith, Bill would end up in different bins, and would never be matched, although they are obviously similar.

For this specific case, you may consider a different attribute, or use phonetization (SOUNDEX, METAPHONE) on the name for binning.

Matching

The matching phase uses a condition that compares two records.
This condition uses two pseudo records named Record1 and Record2 corresponding to the two records being matched. If this condition is true, then the two records are considered as matched.

Sample matching condition (PostgreSQL)

The following matching condition matches pairs of customers meeting one of the following requirements:

  • Their names sound the same in English (they are phonetized using PostgreSQL’s built-in double metaphone function) OR

  • They have multiple elements of similarity:

    • Their names are similar (using Semarchy edit distance function) by more than 80%, and

    • Their city name and address are similar by more than 65%.

( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName)
OR
SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName)>80 )
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.City, Record2.InputAddress.City ) > 65
Overmatching and undermatching

The level of overmatching and undermatching that is acceptable depends on the requirements of each data management project and should be carefully considered, as it can have a significant impact in terms of performance and efficiency.

  • Undermatching happens when matching conditions are too strict, which can result in an excessive amount of duplicate records. While not causing technical problems, actual matches may be overlooked in the process, ultimately affecting data quality.

  • Overmatching happens when matching conditions are too lax, which results in excessively large clusters of matching records that are impossible to manage, either manually or automatically.

Overmatching is a common cause of performance bottlenecks during the matching phase, and can trigger overflow errors due to database limitations during the consolidation process. For this reason, designers should avoid defining match rules that would combine an excessive number of master records (in the range of 1,000) into a single golden record.