Data Quality

Overview

Data Quality Rules

Data Quality Rules include all the rules in the model that enforce a certain level of quality on the entities. These rules include:

  • Mandatory Attributes: An attribute must not have a null value. For example, the Phone attribute in the Customer entity must not be null.

  • References (Mandatory References): An entity with a non-mandatory reference must have a valid referenced entity or a null reference (no referenced entity). For mandatory references, the entity must have a valid reference and does not allow null references.

  • LOV Validation: An attribute with an LOV type must have all its values defined in the LOV. For example, the Gender attribute of the Customer entity is a LOV of type GenderLOV. It must have its values in the following range: [M:Male, F:Female].

  • Unique Keys: A group of column that has a unique value. For example, for a Product entity, the pair ProductFamilyName, ProductName must be unique.

  • Validations: A formula that must be valid on a given record. For example, a Customer entity must have either a valid Email or a valid Address.

The Mandatory Attributes and LOV Validations are designed when creating entities. The references are defined when creating reference relationships.

The data quality rules may be enforced:

  • Automatically in the certification process for all data that is published into the data hub. In this process validations check and reject invalid records. The validations are executed on each record, according to their execution scope.

  • Interactively in applications when users author data.

Validation Execution Scope

Validations can take place pre-consolidation and/or post-consolidation:

  • Pre-Consolidation: Applies to source data, after the enrichment phase and before the matching phase. All the source records pass through all the pre-consolidation checks and records failing one check are isolated from the certification flow. All the errors for each record are raised.

  • Post-Consolidation: Applies to the golden data. All the consolidated records pass through all the post-consolidation checks and records failing one check are flagged as erroneous. They are not removed from the flow. All the errors for each record are raised. Note that post-consolidation validations are not supported for Basic Entities.

Pre-Consolidation validations remove erroneous source records from the certification process, and their data does not contribute to the golden records. Post-Consolidation validations only flag erroneous golden records. They still appear as Golden Records with Errors.
Unique Keys are only checked post-consolidation, as they only make sense on consolidated records.
When a child record references a parent record detected as erroneous during the pre-consolidation validation, then this child record is also rejected in the pre-consolidation validation by the reference validation. If a previous valid version of the parent record exists in the hub, then the child record is attached to that version instead. During post-consolidation validation, a reference is always considered valid if the referenced record exists (flagged as erroneous or not).

Choose the Validation Scope

Choosing the scope of a validation has impact on the certification process.
The following examples will illustrate the impact of the choice of the pre or post consolidation validation.

Post-Consolidation Validation

A CheckNullRevenue validation checks that Revenue is not null for a Customer entity.
With this entity:

  • Customer data is published from the CRM and Sales applications.

  • Only the Sales publisher loads revenue data. CRM leaves this attribute null.

  • The consolidation needs critical information from the CRM application (email, name, address, etc…)

Validation scope impact on data certification:

  • If CheckNullRevenue is executed pre-consolidation, all data from the CRM is rejected, as revenue is null. No data from the CRM is matched or merged and the golden records are incomplete.

  • If CheckNullRevenue is executed post-Consolidation, records from both CRM and Sales are matched and merged, and the resulting records are flagged as erroneous if they still have a null revenue.

Pre-Consolidation Validation

The matching process for the Customer entity uses the GeocodedAddress to match customers from all the sources. This value should be very accurate.

An IsValidGeocodedAddress validation checks that GeocodedAddress is not empty and GeocodedAddress.Quality is high enough after the enrichment phase. If the GeocodedAddress is empty or not good enough, then these customers should not be processed further as they are not fit for the matching phase.

In this example, IsValidGeocodedAddress should be executed Pre-consolidation to reject the records with addresses not good enough for accurate matching, before the matching phase.