Configure Data Retention

This document describes how to define the retention policies for historical and lineage data.

Introduction to Data Retention

The data hub stores the lineage and history of the certified golden data, that is the data that led to the current state of the golden data:

  • The built-in lineage traces the whole certification process from the golden data back to the sources. It traces the source data changes that were either pushed to the hub through external loads or performed into the hub, for example using steppers.

  • Data historization traces the changes made to the golden and master data.

Preserving the lineage and history is a master data governance requirement and a key regulatory compliance focus. However, keeping this information may also create a large volume of data in the hub storage.

To keep a reasonable volume of information, data location managers schedule data purges.

To make sure lineage and history are preserved according to the data governance and compliance requirements, model designers can apply a data retention policy to a model.

Data Retention Policies

There are different types of data retention policies:

  • The model data retention policy defines the retention duration for history and lineage data in the hub. This policy applies by default to all entities with no specific policy.

  • Entity data retention policies can be specified to override the model retention policy for specific entities.

For example:

  • The hub is configured to retain no history at all. This is the general policy.

  • Employee data are retained for 10 years.

  • Product data are retained forever.

Workflow metadata retention policy

When running a workflow, metadata from the workflow instances and their related stepper and branch instances, obsolete work items, attachments, and datasets may be stored in the database schema.

The retention and purge of workflow metadata are distinct considerations that should be addressed by workflow designers. For this reason, the retention policy of a specific workflow must be configured as a workflow definition property within the Workflow Builder.

Data Purge

Depending on the retention policy defined for the model, data purge takes place in the deployed hub.

The purge deletes the following elements of the lineage and history:

  • Source data published to the hub via external loads.

  • Data authored (created, modified or overridden) in the hub.

  • Traces of deleted data.

  • Golden and master data history (if historization is configured).

  • Errors detected on the source and authored data by the integration job.

  • Duplicate choices made by users in duplicate managers.
    Note that duplicate-management decisions still apply after a purge, but information about the time of the decision and the decision maker is deleted.

The purges only impact the history and lineage of the data in the data location. They do not delete actual golden and master data.

Optionally, the following repository artifacts can also be deleted as part of the purge process:

  • Job logs, batches and loads for which all the processed data has been purged.

  • Direct authoring, duplicate manager and workflow instances for which all the changed data has been purged.

Job logs, batches, loads, direct authoring, duplicate manager and workflow instances are purged when all their data have been purged. Therefore they are purged based on the longest retention policy of all the entities that they manage.

Deploy Purge Jobs

When a model is deployed, a purge job is automatically created in the deployment data location. This job purges data and artifacts according to the retention policy defined in the deployed model.

Purge jobs are scheduled by the data location manager as part of the data location configuration. For more information, see configure data purge.

Regardless of the frequency of the purge schedule, the data history is retained as defined by the model designer in the data retention policies.

Define a Default Retention Policy

The model retention policy applies to all entities that are not subject to a specific retention policy.

To define the default data retention policy:

  1. In the Application Builder, open the model edition for which you wish to define a retention policy.

  2. In the Model Design view, double-click the Retention Policies node. The Data Retention Policy editor opens.

  3. In the Data Retention Policy editor, in the Data Retention Policy section, set the properties for each of the following types of data:

    • Source Data

    • Source Errors

    • History

    • Deletions

  4. (Optional) In the Description field, enter a description for the retention policy.

  5. Press Control+S (Command+S on macOS) to save your changes.

Define an Entity Data Retention Policy

The default retention policy defined in the previous section applies to all entities. You can also define entity-specific retention policies to override the default retention policy.

To define an entity data retention policy:

  1. In the Data Retention Policy editor, click Add Entity Retention Policy. The Create New Entity Retention Policy wizard opens.

  2. In the Create New Entity Retention Policy wizard, in the Entity field, select the entity for which you want to define a retention policy.

  3. Set the properties for each of the following types of data:

    • Source Data

    • Source Errors

    • History

    • Deletions

  4. Click Finish to close the wizard.

  5. Press Control+S (Command+S on macOS) to save your changes.

You can only have one entity retention policy per entity of the model.
The retention policy has no effect unless the model is deployed and a purge schedule is configured.

Data Retention Properties

The following table lists the properties used for defining the retention policy for different types of data.

Table 1. Data retention properties
Property Description

<DataType> Retention Type

Defines how long the data should be retained. Possible values are:

  • Forever: the data are never deleted.

  • No Retention: the data are not retained.

  • Period: the data are retained for a specified duration.

<DataType> Time Duration

Only editable if the retention type is set to Period.

Number representing the duration for which the data should be retained.

<DataType> Time Unit

Only editable if the retention type is set to Period.

Unit of the duration. Possible values are:

  • Days

  • Hours

  • Months

  • Weeks

  • Years