Integrate with Purview

This document explains how to integrate Semarchy xDM with Microsoft Purview.

Purview is a data governance and data catalog available in Microsoft Azure, to store and manage physical as well as logical metadata assets.

Semarchy xDM comes with a Purview Connector to synchronize metadata from Semarchy data hubs into Purview, while linking the logical model assets (entities and attributes) to the corresponding physical assets (tables and columns), and enabling end-to-end data lineage.

Overview

The Purview Connector converts Semarchy metadata into Purview assets in the following way:

  • Semarchy xDM instance, data location, entities, attributes, and relationships are converted into Purview entities, using Semarchy-specific asset types.

  • Entities and attributes are related to the corresponding physical tables (GD_, MD_, etc) and columns, previously scanned using the built-in Purview scanners for Microsoft SQL Server, PostgreSQL, or Oracle. These tables and columns are enriched with information from Semarchy.

  • A process is created to relate the physical tables and represent the certification process for each entity.

Configure Purview

Configure a Collection

The Purview Connector creates the Semarchy xDM assets in a collection. It is recommended to create a dedicated collection for the Semarchy xDM assets, and note the Name of this collection.

Configure the REST API

The Purview Connector creates and updates assets in Purview using the REST API.

To enable this connectivity, follow the instructions in the Purview documentation to configure and use the REST API.

While configuring the REST API, make sure to collect the following information:

  • The Azure Tenant Id: Search for the "Tenant Properties" in the Azure portal. This ID will be available on the tenant properties page.

  • The Purview Account name. For example, SemarchyDemoPurview.

  • The Azure Application Client ID and Azure Application Client Secret of the application registered in Azure Active directory and assigned to a data plane role (Data Curator) for the Microsoft Purview account.

Configure Semarchy xDM

The Purview Connector extracts metadata from Semarchy xDM using a REST API, and authenticates using an API Key.

In Semarchy xDM, create an API Key with the Application Management and Repository Information privileges.

Make sure that you have the following information to connect your Semarchy instance:

Scan the Data Location Schema

When running, the Purview Connector searches for the physical assets (tables, columns) of the data locations in the Purview catalog, to relate them to the logical assets.

Before running the Purview Connector, you must scan the xDM data location schemas using the Purview built-in scanners and harvest the physical assets' metadata (table and column definitions).

To scan a data location schema:

  1. In Purview, register a new data source pointing to the data location schema.

  2. Scan this data source.
    After the scan, the table and columns are visible in Purview.

  3. Search for a GD_ table corresponding to your data location to confirm that the scan was successful.

  4. Navigate to the schema (the container) hosting the tables, and note the Qualified Name of that schema. For example:
    postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_product_retail_mdm

Repeat the previous steps for each data location deployed for the Semarchy xDM instance.

Each data source technology has specific configuration steps that are detailed in the Purview documentation. For example, certain sources require that you store the database password in Azure Key Vault.
Review the following pages for each technology:

Deploy the Purview Connector

The Purview Connector retrieves metadata from deployed model editions in an xDM instance, creates logical assets in the Purview collection that you created, and relates them to the physical assets that you scanned.

The Purview Connector is available as a Docker image, which can be deployed and executed as an Azure function.

Prepare the Configuration File

You configure the Azure Function at deployment time using a JSON configuration file.

To prepare the configuration file:

  1. Download the sample Azure Function App configuration file (create-azure-function-app-settings.json).

  2. Edit this file and set the configuration properties, listed below.

Table 1. Configuration Properties
Property Value

xdmPurviewAccount

The Purview Account name you retreived when configuring the Purview REST API. For example: SemarchyDemoPurview

xdmPurviewTenantId

The Azure Tenant ID you retreived when configuring the Purview REST API.

xdmPurviewClientId

The Azure Client ID created when configuring the Purview REST API.

xdmPurviewClientSecret

The Azure Client Secret created when configuring the Purview REST API.

xdmDataLocationPurviewQualifiedName_<data-location-name>

The qualified name of the Purview entity representing the container (database schema) hosting the tables of the data location named <data-location-name>. You retrieved this qualified name after scanning the data location schema.

Create one property for each Semarchy xDM data location to synchronize.

For example, to synchronize two data locations named CustomerB2CDemo and ProductRetailDemo, set the qualified names for their schemas as shown below:

{
  ...
  "xdmDataLocationPurviewQualifiedName_CustomerB2CDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_customer_b2c_mdm",
  "xdmDataLocationPurviewQualifiedName_ProductRetailDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_product_retail_mdm",
  ...
}

xdmPurviewCollectionName

The name of the collection you configured for the xDM assets. For example: xDM Assets

xdmPurviewSkipTypes

The Purview Connector creates Semarchy-specific assets types before the Semarchy assets. Set to true to skip the asset types creation, for example, if they already exist in Purview and do not require an update.

xdmPurviewSkipPhysicalAssetsUpdate

The Purview Connector updates the tables and columns with descriptions based on Semarchy metadata. Set this property to true to skip this update.

xdmPurviewDryRun

Set this property to true to perform a dry run, without creating or updating anything in Purview.

xdmPurviewConnectorSchedule

Cron schedule for the Purview Connector execution. For example, 0 * * * * *

xdmInstanceUrl

The Semarchy instance URL, typically http://<host>:<port>/semarchy.

xdmInstanceApiKey

The API key to connect the Semarchy instance.

scheduledXdmPurviewConnectorDisabled

Set to true to disable the scheduled execution.

Set this option to true before running manual operations on the Purview Connector.
Example 1. Sample create-azure-function-app-settings.json configuration file.
{
  "xdmPurviewAccount": "SemarchyDemoPurview",
  "xdmPurviewTenantId": "758077ec-66b9-441c-9537-b0939cb3dfe8",
  "xdmPurviewClientId": "xxxxxxxxx",
  "xdmPurviewClientSecret": "xxxxxxxxx",
  "xdmDataLocationPurviewQualifiedName_CustomerB2CDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_customer_b2c_mdm",
  "xdmDataLocationPurviewQualifiedName_ProductRetailDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_product_retail_mdm",
  "xdmPurviewCollectionName": "xDM Assets",
  "xdmPurviewSkipTypes": "true",
  "xdmPurviewSkipPhysicalAssetsUpdate": "true",
  "xdmPurviewDryRun": "true",
  "xdmPurviewConnectorSchedule": "0 * * * * *",
  "xdmInstanceUrl": "http://176.159.263.21:10081/semarchy",
  "xdmInstanceApiKey": "xxxxxxxxx",
  "scheduledXdmPurviewConnectorDisabled": "true",
  "FUNCTIONS_WORKER_RUNTIME": "java"
}

Deploy the Azure Function

  1. Download the Azure function creation (create-azure-function-app.sh) script.

  2. Review and amend the script depending on your Azure and local environments.

  3. Run the script to create all the resources required to run the connector.

This script creates or updates the following resources:

  • A Resource Group.

  • A Storage Account within the selected Resource Group.

  • A Premium Plan for the Function App (see below).

  • An Azure Function App that contains the Azure functions used to launch the operations available on the connector.

This script expects and uses the create-azure-function-app-settings.json located in the same folder to configure the deployed azure function.

Example 2. Azure function creation syntax
./azure-function-script.sh
  --name <azure-function-name>
  --storage-account <storage-account-name>
  --resource-group <resource-group-name>
  --docker-image-tag <docker-image-tag>

Main script parameters:

  • --name (required): Name of Azure Function App to deploy. This name must be unique (it is used for the endpoint URL: https://<name>.azurewebsites.net/api/)

  • --storage-account (required): Name of the Azure Storage Account to create or update (it must be unique, and contain 3 to 24 characters numbers and lowercase letters only).

  • --resource-group (required): Name of Azure Resource Group into which the function is created. If it does not exist, the script creates the group, provided that the account has sufficient privileges.

  • --docker-image-tag: Docker image tag to use for the connector (defaults to latest). The tag must correspond to the version of the Semarchy instance from which the connector will harvest metadata.

To see all available parameters, run ./azure-function-script.sh -h.

Example 3. Azure function creation sample
./create-azure-function-app.sh
  --name xdm-purview-sync
  --storage-account xdmstorageaccount
  --resource-group xdm-group
  --docker-image-tag 2023.2.0
For macOS Users

The script uses the getopt command that is not available by default in macOS.
To install this tool:

  1. Install util-linux using HomeBrew:

    brew install util-linux
  2. Add getopt to your path using the command corresponding to your shell:

    # bash
    sudo echo 'export PATH="/usr/local/opt/gnu-getopt/bin:$PATH"' >> ~/.bash_profile
    # zshrc
    sudo echo 'export PATH="/usr/local/opt/gnu-getopt/bin:$PATH"' >> ~/.zshrc

Run the Purview Connector Manually

Once configured with a Cron schedule, the Azure Function runs the Purview Connector automatically.

You can also run operations using the REST endpoints exposed by the function.

To use the Azure Function endpoints:

  1. Retrieve the endpoint URL from the Azure Function.
    purview call function 1

    The code query parameter may be passed as a Request Header named x-functions-key.
  2. Use a REST client to perform the request.
    purview call function 2

  3. Review the Semarchy assets created in the collection
    purview call function 3

To avoid execution conflicts between your manual and the scheduled executions of the Purview Connector, make sure to change the Azure Function configuration and set the scheduledXdmPurviewConnectorDisabled property to false before running Purview Connector manual operations.

POST Operation

The POST operation runs the Purview Connector manually. It accepts the properties listed below in JSON format in the request body.

Table 2. POST operation parameters
Name Description

dryRun

Set this property to true to perform a dry run, without creating or updating anything in Purview. This parameter defaults to the corresponding property set in the function application configuration.

skipTypes

Set this property to true to skip asset-type creation or update. This parameter defaults to the corresponding property set in the function application configuration.

skipPhysicalAssetsUpdate

The Purview Connector updates the tables and columns with descriptions based on Semarchy metadata. Set this property to true to skip this update. This parameter defaults to the corresponding property set in the function application configuration.

dataLocationNames

Array of data location names to synchronize. Use this option to selectively synchronize a subset of the data location. This list defaults to all Semarchy data locations in the Ready status and with a qualified name defined in the application configuration.

A data location not found in Semarchy or with no qualified name correspondence in the application settings will be ignored.

purviewCollectionName

The name of the collection you configured to receive the Semarchy assets. This parameter defaults to the corresponding property set in the function application configuration.

Example 4. POST operation sample request body
{
    "dryRun": true,
    "skipTypes": false,
    "skipPhysicalAssetsUpdate": false,
    "dataLocationNames": ["CustomerB2CDemo"],
    "purviewCollectionName": "{mdm-regular-product-name}"
}
The process can take a relatively long time (5-10 min) depending on options and size/number of data locations.

DELETE Operation

The DELETE operation deletes all assets created by the Purview Connector. It accepts the properties listed below in JSON format in the request body.

Table 3. DELETE operation parameters
Name Description

dryRun

Set this property to true to perform a dry run without deleting anything in Purview. This parameter defaults to the corresponding property set in the function application configuration.

deletePhysicalAssets

Set this property to true to delete the physical assets (table, columns) related to Semarchy logical assets. This parameter defaults to false.

Setting this property to true deletes physical assets created by the built-in Purview scanners.
Example 5. POST operation sample request body
{
    "dryRun": true,
    "deletePhysicalAssets": false
}
A DELETE operation cannot be undone.