To perform the matching process, pair of records are compared using multiple match rules, to decide whether they are duplicates.
Each Match Rule runs as a two phase logical process:
Binning: This phase uses a set of SemQL expressions to group the records into bins. Binning is a Divide and Conquer approach to avoid an excessive number of comparisons.
Matching: This phase uses a SemQL condition that compares all the pairs of records (Record1 is compared with Record2) within a bin. When this condition returns true, the pair of records are considered as duplicates.
The binning phase divides the source records into bins to allow the matching phase to run only within a given bin. As the matching phase can be a resource consuming one, reducing the number of comparisons is important for high performance matching.
Instead of trying to match all customers, we will try to match customers only when they are located in the same country. Binning will take the Country attribute as the binning expression.
If we want to make the matching phase even more efficient, we could match customers with the same Country and SalesRegion.
Binning is completed using several SemQL expressions defined in the match rule. The records for which all binning expressions give the same results belong to the same bin.
To perform binning for Customers with the same Country and Region’s first letter in the GeocodedAddress complex field, we would use:
Binning Expression #1:
Binning Expression #2:
|Smaller Bins will mean faster processing, but, you must make sure that binning does not exclude possible matches.|
Binning persons by the first four letters of their last name is not a good choice:
Jones-Smith, Bill and
Jonnes-Smith, Bill would end up into different bins, and would never be matched, although they are obviously similar.
For this specific case, you may consider a different attribute, or use phonetization (SOUNDEX, METAPHONE) on the name for binning.
The matching phase uses a condition that compares two records.
This condition uses two pseudo records named Record1 and Record2 corresponding to the two records being matched. If this condition is true, then the two records are considered as matched.
The following matching condition matches pairs of customers meeting one of following requirements:
Their names sound the same in English (they are phonetized using PostgreSQL’s built-in double metaphone function) OR
They have multiple elements of similarity:
Their names are similar (using Semarchy edit distance function) by more than 80%, AND
Their city names and addresses are similar by more 65%.
( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName) OR SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName)>80 ) and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65 and SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.City, Record2.InputAddress.City ) > 65