As I described in a previous post, MDM revolves around metadata and should be based on comprehensive logical modeling. In this post, I also stated that the MDM hub should be generated from this metadata, and should implement the processes and structures to automatically enrich, validate, match, de-duplicate and certify golden data from raw source data.
Today, I will try to go into the details of the various aspects of this data certification process and describe Semarchy’s Convergence Hub™.
What is Semarchy’s Convergence Hub™?
It is a set of automated processes and their storage to certify golden data from source data published by applications in the master database.
Convergence Hub™ is at the core of the Semarchy Convergence for MDM™, as Semarchy generates it from the metadata defined in the platform.
Semarchy’s Convergence Hub™ in Action
Find below an example of Convergence Hub™ in action (click to zoom in).
- Two source Publishers (in this case, a CRM and SAP) contain customers, and push Source customer data to Semarchy Convergence Hub™.
- Convergence Hub™ runs an Enrichment process on this data to make it normalized and complete. In this case, the address data is geocoded, using for example the Google or Yahoo APIs. After this phase, the address contains geocoded information (with longitude, latitude and a precise postal addresses).After this phase, a first data Validation check takes place to exclude source data too bad for further processing. On the schema, the validation is not shown, but you may notice that the email address is checked.
- Convergence Hub™ is now able to match the enriched and cleansed data, using a Matching process. After this phase, duplicates are automatically identified and grouped.Here, you have identifiied two similar records (Same geocoded address, almost the same name (Litrera vs. Utrera) from the two publishers.Note that:
- Matching takes place within and between applications: If duplicates exist within SAP or the CRM, they will be detected. If duplicates exist between SAP and the CRM, they will be detected too.
- Matching uses fuzzy algorithms detection.
- Using the groups of similar records (detected duplicates), Convergence Hub™ performs a Consolidationphase to create survivors with a Golden Record ID while keeping references to the Source records for lineage. The consolidation strategies are very granular. For example, you can consolidate Customer Name using the most frequent value, and always take the email value from the CRM in priority.After this phase, another Validation (not shown on the schema) takes place to check that the consolidated data can be considered as valid and certified golden data.
- Finally, Convergence Hub™ merges the certified data into the Golden Records.
Let us look at how to implement this into successive data buckets.
Implementing Convergence Hub™
The schema below shows a simplified view of the process implementation.
Convergence Hub™ is implemented in the following way:
- Source Data sent by the source publishers is passed through the Enrichment phase.
- It is passed through the Validation phase, which results in Source Errors and Master Integration records.
- The Master Integration records pass through the Matching and Consolidation phase to create Golden Integration records (unique records).
- These consolidated records are passed through another round of Validation, and invalid record end up into the Post-Consolidation Errors.
- Finally, the Golden Data and Master Data are merged from the Integration records.
Few questions you may ask…
When looking at this pattern, you may have the following questions:
Why do you keep the Master/Golden Integration Data?
These are needed for lineage purposes.Simply said, Lineage answers the following question: “Where does this (data) come from?”
There are two main ways of navigating the golden data lineage via Convergence Hub™:
- If you are browsing a Golden Record and need to see the various records it originates from, you would navigate Golden Data > Master Data > Source Data.
- If you want to see all records, including those that were rejected, you cannot use the same path (These records are not in my Golden Data or “Master Data”, as these contain only data that made it to the “Golden” stage). You would go through Golden Integration > Master Integration > Source Data. You could even start from Post-Consolidation Errors and understand why a record was rejected late in the process: Is it because the consolidation is not right? It is because the enrichment is not good enough?
Is data historized?
Of course ! As source system will push new batches – or deltas – in the Source Data, you need to keep a history of these batches and be able to track the origin of golden data changes over time.
How is Convergence Hub™ really implemented? Is it Java? What is the storage?
Storage and processing is done principally within a database. This guarantees best performances even with the largest volumes (really think “terabytes and above”).
Isn’t it a little bit complex?
Yes, but it is extremely simple to implement this using the Semarchy Convergence for MDM™: Semarchy will automatically generate this entire pattern (storage AND processes) for you in a database, using a logical model design.
So you would not really have to worry about this pattern’s complexity, unless you try to implement it by yourself.
For more information about how to get this done, see the Semarchy Demo Center.