Configure data retention

This page describes how to define the retention policies for historical and lineage data.

Introduction to data retention

The data hub stores the lineage and history of the certified golden data, that is the data that led to the current state of the golden data:

  • The built-in Lineage traces the whole certification process from the golden data back to the sources. It traces source data changes pushed in the hub either through external loads or performed into the hub, for example using steppers.

  • Data Historization traces the changes made to the golden and master data.

Preserving the lineage and history is a master data governance requirement. It is key in a regulatory compliance focus. However, keeping this information may also create a large volume of data in the hub storage.

To keep a reasonable volume of information, data location managers schedule purges for this data.

To make sure lineage and history are preserved according to the data governance and compliance requirements, model designers define Data Retention Policy for the model.

Data retention policies

Data Retention Policies are defined with:

  • A Model Retention Policy, defining the retention duration for history and lineage data in the hub. This policy applies by default to all entities with no specific policy.

  • Entity Retention Policies that override the model retention policies for specific entities.

For example:

  • The hub is configured to retain no history at all. This is the general policy.

  • Employee data is retained for 10 years.

  • Product data is retained forever.

Data purge

Depending on the retention policy defined for the model, data purge takes place in the deployed hub.

The purge deletes the following elements of the lineage and history:

  • Source data published to the hub via external loads

  • Data authored (created, modified or overridden) in the hub

  • Traces of deleted data

  • Golden and master data history (if historization is configured)

  • Errors detected on the source and authored data by the integration job

  • Duplicate choices made by users in duplicate managers
    Note that the duplicate management decisions still apply after a purge, but information about the time of the decision and the decision maker is deleted.

The purges only impact the history and lineage of the data in the data location. They do not delete actual golden and master data.

Optionally, the following repository artifacts can also be deleted as part of the purge process:

  • Job logs, batches and loads for which all the processed data has been purged.

  • Direct authoring, duplicate manager and workflow instances for which all the changed data has been purged.

Job logs, batches, loads, direct authoring, duplicate manager and workflow instances are purged when all their data have been purged. Therefore they are purged based on the longest retention policy of all the entities that they manage.

Deploy purge jobs

When a model is deployed, a purge job is automatically created in the deployment data location. This job purges data and artifacts according to the retention policy defined in the deployed model.

Purge jobs are scheduled by the data location manager as part of the data location configuration. Refer to Configure data purge for more information.

Regardless of the frequency of the purge schedule, the data history is retained as defined by the model designer in the data retention policies.

Define a default retention policy

The model retention policy applies to all the entities with no specific retention policy.

To define the default data retention policy:

  1. In Application Builder, open the model edition for which you wish to define a retention policy.

  2. In the Model Design view, double-click the Retention Policies node. The Data Retention Policy editor opens.

  3. In the Data Retention Policy editor, in the Data Retention Policy section, set the properties for each of the following types of data:

    • Source Data

    • Source Errors

    • History

    • Deletions

  4. (Optional) In the Description field, enter a description for the retention policy.

  5. Press Control+S (or Command+S on macOS) to save your changes.

Define an Entity Retention Policy

The default retention policy defined above applies to all entities. You can also define entity-specific retention policies to override the default retention policy.

To define an entity retention policy:

  1. In the Data Retention Policy editor, click Add Entity Retention Policy. The Create New Entity Retention Policy wizard opens.

  2. In the Create New Entity Retention Policy wizard, in the Entity field, select the entity for which you want to define a retention policy.

  3. Set the properties for each of the following types of data:

    • Source Data

    • Source Errors

    • History

    • Deletions

  4. Click Finish to close the wizard.

  5. Press Control+S (or Command+S on macOS) to save your changes.

You can only have one entity retention policy per entity of the model.
The retention policy has no effect unless the model is deployed and a purge schedule is configured.

Data retention properties

The following table lists the properties used for defining the retetion policy for differnt types of data.

Table 1. Data retention properties
Property Description

<DataType> Retention Type

Defines how long the data should be retained. Possible values are:

  • Forever: The data is never deleted.

  • No Retention: The data is not retained.

  • Period: The data is retained for the specified duration.

<DataType> Time Duration

Only editable if the retention type is set to Period.

Number representing the duration for which the data should be retained.

<DataType> Time Unit

Only editable if the retention type is set to Period.

Unit of the duration. Possible values are:

  • Days

  • Hours

  • Months

  • Weeks

  • Years