Configure data purge

Data purging helps maintain a reasonable storage volume for the data location and the repository by pruning the history of data changes and job logs.

Introduction to data purge

The data location stores the lineage and history of the certified golden data—​that is, the data that led to the current state of the golden data.

Preserving the lineage and history is a master data governance requirement. It is key in a regulatory compliance focus. However, keeping this information may also create a large volume of data in the hub storage.

To make sure lineage and history are preserved according to the data governance and compliance requirements, model designers will want to define a data retention policy for the model.

When a model is deployed to a data location, a purge job is automatically created to handle data pruning according to the retention policy. The purge job prunes the lineage and history data according to the retention policy. Optionally, it prunes the job logs, batches, loads, and direct-authoring process, duplicate manager, and workflow instances when their data is completely purged.

To keep a reasonable volume of information, data location managers must schedule regular executions of this job.

Configure a purge schedule

To create a purge schedule:

  1. In the Management view, expand the Data Locations node.

  2. Expand the data location for which you want to configure a purge.

  3. Double-click the Purge node. The Purge Schedule editor opens.

  4. Select or clear the Active checkbox to make the purge schedule active or inactive.

  5. Click the edit expression button Edit button, and set the schedule for the purge with a purge frequency (monthly, weekly, or daily) or using a cron expression.

  6. Click OK to save the schedule.

  7. Select the Purge Repository Artifacts option to prune the job logs, batches, loads, direct authoring, duplicate manager, and workflow instances when all their data is purged.

  8. Press Control+S (Command+S on macOS) to save the editor.

Regardless of the frequency of the purges scheduled by the data location manager, the data history retained is as defined by the model designer in the data retention policies.