Configure data retention

This document describes how to define the retention policies for historical and lineage data.

Introduction to data retention

The data hub stores the lineage and history of the certified golden data, that is the data that led to the current state of the golden data:

  • The built-in Lineage traces the whole certification process from the golden data back to the sources. It traces source data changes pushed in the hub - through external loads - or performed into the hub - for example using steppers.

  • Data Historization traces the changes made to the golden and master data.

Preserving the lineage and history is a master data governance requirement. It is key in a regulatory compliance focus. However, keeping this information may also create a large volume of data in the hub storage.

To keep a reasonable volume of information, data location managers will schedule purges for this data. To make sure lineage and history are preserved according to the data governance and compliance requirements, model designers will want to define Data Retention Policy for the model.

Data retention policies

Data Retention Policies are defined with:

  • A Model Retention Policy, defining the retention duration for history and lineage data in the hub. This policy applies by default to all entities with no specific policy.

  • Entity Retention Policies that override the model retention policies for specific entities.

For example:

  • The hub is configured to retain no history at all. This is the general policy.

  • Employee data is retained for 10 years.

  • Product data is retained forever.

Data purge

Depending on the retention policy defined for the model, data purge takes place in the deployed hub.

The purge deletes the following elements of the lineage and history:

  • Source data published to the hub via external loads

  • Data authored (created, modified or overridden) in the hub

  • Traces of deleted data

  • Golden and master data history (if historization is configured)

  • Errors detected on the source and authored data by the integration job

  • Duplicate choices made by users in duplicate managers.
    Note that the duplicate management decisions still apply after a purge, but information about the time of the decision and the decision maker is deleted.

The purges only impact the history and lineage of the data in the data location. They do not delete actual golden and master data.

Optionally, the following repository artifacts can also be deleted as part of the purge process:

  • Job logs, batches and loads for which all the processed data has been purged.

  • Direct authoring, duplicate manager and workflow instances for which all the changed data has been purged.

Job logs, batches, loads, direct authoring, duplicate manager and workflow instances are purged when all their data have been purged. Therefore they are purged based on the longest retention policy of all the entities that they manage.

Deploy purge jobs

When a model is deployed, a purge job is automatically created in the deployment data location. This job purges data and artifacts according to the retention policy defined in the deployed model.

Purge jobs are scheduled by the data location manager as part of the data location configuration. See Configure data purge for more information.

Regardless of the frequency of the purge schedule, the data history is retained as defined by the model designer in the data retention policies.

Define data retention policies

The model retention policy applies to all the entities with no specific retention policy.

To define the model data retention policy:

  1. Connect to an open model edition.

  2. In the Model Design view, double-click the Retention Policies node. The Data Retention Policy editor opens.

  3. In the Data Retention Policy editor, set the Default Retention Policy for the model:

    • Retention Period: Select the default retention type: Forever, No Retention or Period.

    • If you have selected Period, set a Time Unit and Time Duration for the retention period.

    • In the Description field, optionally enter a description for the retention policy.

  4. Press Control+S (or Command+S on macOS) to save the editor.

In addition to the model retention policy, you can also define in this editor the entity-specific retention policies.

To define an entity retention policy:

  1. In the Data Retention Policy editor, click the Add Entity Retention Policy. The Create New Entity Retention Policy wizard opens.

  2. In the Create New Entity Retention Policy wizard:

    • Select in the Entity field the entity for which you want to define a specific policy.

    • Retention Period: Select the retention type: Forever, No Retention or Period.

    • If you have selected Period, set a Time Unit and Time Duration for the retention period.

    • In the Description field, optionally enter a description for the entity retention policy.

  3. Click Finish to close the wizard.

  4. Press Control+S (or Command+S on macOS) to save the editor.

You can only have one entity retention policy per entity of the model.
The retention policy has no effect unless the model is deployed and a purge schedule is configured.