Matching records
The matching process involves comparing pairs of records using multiple match rules to determine if they are duplicates.
The actual matching phase may be preceded by a binning phase.
Binning
Binning groups records into smaller, more manageable sets, or bins. This approach reduces the number of comparisons needed by narrowing down the search scope.
It is achieved through multiple SemQL expressions specified in a match rule. Records for which all binning expressions produce identical results are placed in the same bin.
While binning is a valuable tool that makes the matching logic conceptually simpler, it does not necessarily enhance performance. Its effectiveness relies on using selective binning attributes and having custom indexes on the related columns in the database.
|
Effective use of binning
Instead of matching all records globally, binning allows for:
-
Matching records based on a specific attribute (e.g., matching customers only when they are located in the same country).
-
Further narrowing the binning results by combining attributes or specific functions (e.g., matching customers within the same country and sales region).
To bin customers according to their country and sales region in the GeocodedAddress
complex field, you could use the following expressions:
-
Binning expression #1:
GeocodedAddress.Country
-
Binning expression #2:
GeocodedAddress.Region
Smaller bins lead to faster processing, but designers should make sure that binning does not exclude potential matches. Example. Incorrect binning
Using the first four letters of a last name for binning can separate similar records into different bins, leading to missed matches. In such a case, To prevent this issue, designers could use more appropriate attributes or apply phonetization techniques like SOUNDEX or METAPHONE for binning. These techniques help capture similar-sounding names, ensuring that potential matches are not overlooked due to minor variations in spelling. |
Matching
During the matching phase, a SemQL condition compares all the pairs of records within a bin. This condition uses two pseudo records, named Record1 and Record2, corresponding to the two records being matched. If the SemQL condition returns true, records are considered duplicates.
The following matching condition matches pairs of customers who meet one of these criteria:
-
Their names sound the same in English (they are phonetized using PostgreSQL’s built-in double metaphone function), or
-
They have multiple elements of similarity:
-
Their names are over 80% similar (using Semarchy edit distance function), and
-
Their city name and address are more than 65% similar.
-
( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName)
OR
SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName) > 80 )
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.City, Record2.InputAddress.City) > 65
Overmatching and undermatching
The level of overmatching and undermatching that is acceptable depends on the requirements of each data management project and should be carefully considered, as it can have a significant impact in terms of performance and efficiency.
|