Data quality
Data quality involves maintaining accurate and reliable data through rules, validation processes, and execution scopes.
Data quality rules
Data quality rules include all rules in a model that impose a certain quality standard on the entities. These rules include:
-
Mandatory attributes: these attributes must not have a null value. For example, the Phone attribute in the Customer entity must not be null.
-
References (mandatory references): an entity with a non-mandatory reference must have either a valid referenced entity or a null reference (i.e., no referenced entity). For mandatory references, the entity must have a valid reference and does not allow null references.
-
LOV validations: attributes with an LOV type must have all values defined in the LOV. For example, the Gender attribute of the Customer entity is a LOV of type GenderLOV, requiring values in the following range: [M: male, F: female].
-
Unique keys: defines a group of columns that has a unique value. For example, for a Product entity, the pair ProductFamilyName, ProductName must be unique.
-
Validations: formulas that must be valid for a given record. For example, a Customer entity must have either a valid Email or a valid Address.
Mandatory attributes and LOV validations are set when creating entities, while references are defined when creating reference relationships. |
Data quality rules may be enforced:
-
Automatically in the certification process for all data published into the data hub. Validations check and reject invalid records, and are executed on each record according to its execution scope.
-
Interactively in applications when users author data.
Validation execution scope
Validations can occur pre-consolidation and/or post-consolidation.
-
Pre-consolidation: applies to source data after the enrichment phase and before the matching phase. Every source record undergoes all pre-consolidation checks, and records failing any check are isolated from the certification flow. All errors for each record are raised.
-
Post-consolidation: applies to golden data. Every consolidated record goes through all post-consolidation checks, and records failing any check are flagged as erroneous, but not removed from the flow. All errors for each record are raised.
Post-consolidation validations are not applicable to basic entities. |
Pre-consolidation validations remove erroneous source records from the certification process, and their data does not contribute to golden records. Post-consolidation validations only flag erroneous golden records. They still appear as golden records with errors. |
Unique keys are only checked post-consolidation, as they make sense only for consolidated records. |
When a child record references a parent record detected as erroneous during the pre-consolidation validation, then this child record is also rejected in the pre-consolidation validation by the reference validation. If a previous valid version of the parent record exists in the hub, then the child record is attached to that version instead. During post-consolidation validation, a reference is always considered valid if the referenced record exists (whether it is flagged as erroneous or not). |
Choosing the validation scope
Choosing the scope of a validation impacts the certification process.
The examples below illustrate the impact of pre- or post-consolidation validation.
The matching process for the Customer entity uses GeocodedAddress to match customers from all sources. This value has to be very accurate.
An IsValidGeocodedAddress validation checks that GeocodedAddress is not empty and GeocodedAddress.Quality is high enough after the enrichment phase. If the GeocodedAddress is empty or unsatisfactory, then these customers' records should not be processed further as they are not fit for the matching phase.
In this example, IsValidGeocodedAddress should be executed pre-consolidation to reject any record whose address does not allow for accurate matching.
A CheckNullRevenue validation ensures revenue is not null for a Customer entity.
With this entity:
-
Customer data is published from the CRM and Sales applications.
-
Only the Sales publisher loads revenue data. CRM leaves this attribute null.
-
Critical information is required from the CRM application (email, name, address, etc.) during the certification process.
This is the potential impact of the validation scope on the data certification process:
-
If CheckNullRevenue is executed pre-consolidation, all data from the CRM is rejected, as revenue is null. No data from the CRM is matched or merged and the golden records are incomplete.
-
If CheckNullRevenue is executed post-consolidation, records from both CRM and Sales are matched and merged, and the resulting records are flagged as erroneous if they still have a null revenue.