To perform the matching process, pairs of records are compared using multiple match rules, to decide whether they are duplicates.
Each match rule runs as a two-phase logical process:
Binning: this phase uses a set of SemQL expressions to group the records into bins. Binning is a "divide and conquer" approach to avoid an excessive number of comparisons.
Matching: this phase uses a SemQL condition that compares all the pairs of records (Record1 is compared with Record2) within a bin. When this condition returns true, the pair of records are considered as duplicates.
The binning phase divides the source records into bins to allow the matching phase to run only within a given bin. As the matching phase can be a resource-consuming one, reducing the number of comparisons is important for high-performance matching.
Instead of trying to match all customers, we will try to match customers only when they are located in the same country. Binning will take the Country attribute as the binning expression.
If we want to make the matching phase even more efficient, we could match customers with the same Country and SalesRegion.
Binning is completed using several SemQL expressions defined in the match rule. The records for which all binning expressions give the same results belong to the same bin.
To perform binning on customers whose country and region start with the same letter in the GeocodedAddress complex field, we would use:
Binning expression #1:
Binning expression #2:
|Smaller bins mean faster processing, but you must make sure that binning does not exclude possible matches.
Binning persons by the first four letters of their last name is not a good choice:
Jones-Smith, Bill and
Jonnes-Smith, Bill would end up in different bins, and would never be matched, although they are obviously similar.
For this specific case, you may consider a different attribute, or use phonetization (SOUNDEX, METAPHONE) on the name for binning.
The matching phase uses a condition that compares two records.
This condition uses two pseudo records, named Record1 and Record2, corresponding to the two records being matched. If this condition is true, then the two records are considered as matched.
The following matching condition matches pairs of customers meeting one of the following requirements:
Their names sound the same in English (they are phonetized using PostgreSQL built-in double metaphone function), or
They have multiple elements of similarity:
Their names are similar (using Semarchy edit distance function) by more than 80%, and
Their city name and address are similar by more than 65%.
( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName)
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.City, Record2.InputAddress.City ) > 65
Overmatching and undermatching
The level of overmatching and undermatching that is acceptable depends on the requirements of each data management project and should be carefully considered, as it can have a significant impact in terms of performance and efficiency.