Integrate with Purview

This page explains how to integrate Semarchy xDM with Microsoft Purview.

Purview is a data governance service and data catalog on Microsoft Azure, designed to store and manage both physical and logical metadata assets.

With xDM’s Purview connector, metadata from Semarchy data hubs can be synchronized with Purview, linking logical model assets (i.e., entities and attributes) and their corresponding physical assets (i.e., tables and columns), and thereby enabling end-to-end data lineage.

Overview

The Purview connector converts Semarchy metadata into Purview assets as follows:

  • xDM instances, data locations, entities, attributes, and relationships are converted into Purview entities using Semarchy-specific asset types.

  • Entities and attributes are related to the corresponding physical tables (GD_, MD_, etc.) and columns, which are previously scanned by Purview’s built-in scanners for Microsoft SQL Server, PostgreSQL, or Oracle. These tables and columns are enriched with information from Semarchy.

  • A process is created to associate the physical tables and depict the certification process for each entity.

Configure Purview

Configure a collection

The Purview connector creates xDM assets in a collection. It is recommended to create a dedicated collection for these assets, and take note of the collection’s name.

Configure the REST API

The Purview connector leverages the REST API to create and update assets in Purview.

To enable this connectivity, follow the instructions in the Purview documentation to configure and use the REST API.

While configuring the REST API, make sure to collect the following details:

  • The tenant ID: sign in to the Azure portal, browse Microsoft Entra ID  Properties, and scroll down to the Tenant ID section where you can find your tenant ID.

  • The name of the Purview account (e.g., SemarchyDemoPurview).

  • The Application (client) ID and client secret of the application registered in Microsoft Entra ID and assigned to a data plane role (Data Curator) for the Purview account.

Configure Semarchy xDM

The Purview connector retrieves metadata from Semarchy xDM through a REST API, and authenticates using an API key.

In Semarchy xDM, create an API key with the Application Management and Repository Information privileges.

Make sure that you have the following information to connect your Semarchy instance:

  • The API key.

  • The xDM instance’s URL, typically http://<host>:<port>/semarchy. From xDM’s Welcome page, navigate to Configuration  Global Configuration to find the base URL.

Scan the data location schema

During execution, the Purview connector searches for the physical assets (tables, columns) within the data locations cataloged in Purview to correlate them with the logical assets.

Prior to initiating the Purview connector, you must scan the xDM data location schemas using Purview’s built-in scanners and harvest metadata on the physical assets (table and column definitions).

To scan a data location schema:

  1. In Purview, register a new data source pointing to the data location schema.

  2. Scan this data source.
    After the scan, the table and columns are visible in Purview.

  3. Search for a GD_ table corresponding to your data location to confirm that the scan was successful.

  4. Navigate to the schema (the container) hosting the tables, and note the Qualified Name of that schema. For example:
    postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_product_retail_mdm

Repeat the previous steps for each data location deployed for the Semarchy xDM instance.

Each data source technology has specific configuration steps outlined in the official Purview documentation. For example, some sources require storing the database password in Azure Key Vault.
For detailed guidance, see the documentation for each technology:

Deploy the Purview connector

The Purview connector retrieves metadata from deployed model editions in an xDM instance, creates logical assets in the Purview collection that you created, and links them with the previously scanned physical assets.

The Purview connector is packaged as a Docker image, which can be deployed and executed as an Azure Functions application.

Prepare the configuration file

The Azure Functions app is configured during deployment using a JSON configuration file.

To prepare the configuration file:

  1. Download the sample Azure Functions app configuration file (create-azure-function-app-settings.json).

  2. Edit this file and set the configuration properties as indicated below.

Table 1. Configuration properties
Property Value

xdmPurviewAccount

The Purview account name you retreived when configuring the Purview REST API (e.g.,SemarchyDemoPurview)

xdmPurviewTenantId

The tenant ID you retreived when configuring the Purview REST API.

xdmPurviewClientId

The application (client) ID created when configuring the Purview REST API.

xdmPurviewClientSecret

The client secret created when configuring the Purview REST API.

xdmDataLocationPurviewQualifiedName_<data-location-name>

The qualified name of the Purview entity representing the container (database schema) hosting the tables of the data location named <data-location-name>. You retrieved this qualified name after scanning the data location schema.

Create one property for each Semarchy xDM data location to synchronize.

Example

To synchronize two data locations named CustomerB2CDemo and ProductRetailDemo, set the qualified names for their schemas as follows:

{
  ...
  "xdmDataLocationPurviewQualifiedName_CustomerB2CDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_customer_b2c_mdm",
  "xdmDataLocationPurviewQualifiedName_ProductRetailDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_product_retail_mdm",
  ...
}

xdmPurviewCollectionName

The name of the collection you configured for the xDM assets (e.g., xDM assets).

xdmPurviewSkipTypes

The Purview connector creates Semarchy-specific assets types before the Semarchy assets. Set to true to skip the asset types creation, for example, if they already exist in Purview and do not require an update.

xdmPurviewSkipPhysicalAssetsUpdate

The Purview connector updates the tables and columns with descriptions based on Semarchy metadata. Set this property to true to skip this update.

xdmPurviewDryRun

Set this property to true to perform a dry run, without creating or updating anything in Purview.

xdmPurviewConnectorSchedule

Cron schedule for the Purview connector execution (e.g., 0 * * * * *).

xdmInstanceUrl

The Semarchy instance URL, typically http://<host>:<port>/semarchy.

xdmInstanceApiKey

The API key to connect the Semarchy instance.

scheduledXdmPurviewConnectorDisabled

Set to true to disable the scheduled execution.

Example 1. Sample create-azure-function-app-settings.json configuration file
{
  "xdmPurviewAccount": "SemarchyDemoPurview",
  "xdmPurviewTenantId": "758077ec-66b9-441c-9537-b0939cb3dfe8",
  "xdmPurviewClientId": "xxxxxxxxx",
  "xdmPurviewClientSecret": "xxxxxxxxx",
  "xdmDataLocationPurviewQualifiedName_CustomerB2CDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_customer_b2c_mdm",
  "xdmDataLocationPurviewQualifiedName_ProductRetailDemo": "postgresql://servers/176.159.263.21:15432/dbs/postgres/schemas/semarchy_product_retail_mdm",
  "xdmPurviewCollectionName": "xDM Assets",
  "xdmPurviewSkipTypes": "true",
  "xdmPurviewSkipPhysicalAssetsUpdate": "true",
  "xdmPurviewDryRun": "true",
  "xdmPurviewConnectorSchedule": "0 * * * * *",
  "xdmInstanceUrl": "http://176.159.263.21:10081/semarchy",
  "xdmInstanceApiKey": "xxxxxxxxx",
  "scheduledXdmPurviewConnectorDisabled": "true",
  "FUNCTIONS_WORKER_RUNTIME": "java"
}

Deploy the Azure Function

  1. Download the Azure function creation (create-azure-function-app.sh) script.

  2. Review and amend the script depending on your Azure and local environments.

  3. Run the script to create all the resources required to run the connector.

This script creates or updates the following resources:

  • A resource group.

  • A storage account within the selected resource group.

  • A Premium plan for the function app (see below).

  • An Azure Functions app that contains the Azure Functions used to launch the operations available on the connector.

This script expects and uses the create-azure-function-app-settings.json located in the same folder to configure the deployed Azure function.

Example 2. Azure function creation syntax
./azure-function-script.sh
  --name <azure-function-name>
  --storage-account <storage-account-name>
  --resource-group <resource-group-name>
  --docker-image-tag <docker-image-tag>

Main script parameters:

  • --name (required): name of the Azure Functions app to deploy. This name must be unique (it is used for the endpoint URL: https://<name>.azurewebsites.net/api/)

  • --storage-account (required): name of the Azure storage account to create or update (it must be unique, and contain 3 to 24 characters numbers and lowercase letters only).

  • --resource-group (required): name of the Azure resource group into which the function is created. If it does not exist, the script creates the group, provided that the account has sufficient privileges.

  • --docker-image-tag: Docker image tag to use for the connector (defaults to latest). The tag must correspond to the version of the Semarchy instance from which the connector will harvest metadata.

To see all available parameters, run ./azure-function-script.sh -h.

Azure Functions app creation sample
./create-azure-function-app.sh
  --name xdm-purview-connector
  --storage-account xdmstorageaccount
  --resource-group xdm-group
  --docker-image-tag 2023.2.0
For macOS users

The script uses the getopt command that is not available by default on macOS.
To install this tool:

  1. Install util-linux using HomeBrew:

    brew install util-linux
  2. Add getopt to your path using the command corresponding to your shell:

    # bash
    sudo echo 'export PATH="/usr/local/opt/gnu-getopt/bin:$PATH"' >> ~/.bash_profile
    # zshrc
    sudo echo 'export PATH="/usr/local/opt/gnu-getopt/bin:$PATH"' >> ~/.zshrc

Run the Purview connector Manually

Once configured with a Cron schedule, the Azure Function runs the Purview connector automatically.

You can also run operations using the REST endpoints exposed by the function.

To use the Azure Function endpoints:

  1. Retrieve the endpoint URL from the Azure Function.
    purview call function 1

    The code query parameter may be passed as a Request Header named x-functions-key.
  2. Use a REST client to perform the request.
    purview call function 2

  3. Review the Semarchy assets created in the collection
    purview call function 3

To avoid execution conflicts between your manual operations and the scheduled executions of the Purview connector, make sure to change the Azure Functions configuration and set the scheduledXdmPurviewConnectorDisabled property to false before executing Purview connector manual operations.

POST operation

The POST operation runs the Purview connector manually. It accepts the properties listed below in JSON format within the request body.

Table 2. POST operation parameters
Name Description

dryRun

Set this property to true to perform a dry run, without creating or updating anything in Purview. This parameter defaults to the corresponding property set in the function app configuration.

skipTypes

Set this property to true to skip asset-type creation or update. This parameter defaults to the corresponding property set in the function app configuration.

skipPhysicalAssetsUpdate

The Purview connector updates the tables and columns with descriptions based on Semarchy metadata. Set this property to true to skip this update. This parameter defaults to the corresponding property set in the function app configuration.

dataLocationNames

Array of data location names to synchronize. Use this option to selectively synchronize a subset of the data location. This list defaults to all Semarchy data locations in the Ready status and with a qualified name defined in the application configuration.

A data location not found in Semarchy or with no qualified name correspondence in the application settings will be ignored.

purviewCollectionName

The name of the collection you configured to receive the Semarchy assets. This parameter defaults to the corresponding property set in the function application configuration.

Example 3. POST operation sample request body
{
    "dryRun": true,
    "skipTypes": false,
    "skipPhysicalAssetsUpdate": false,
    "dataLocationNames": ["CustomerB2CDemo"],
    "purviewCollectionName": "{mdm-regular-product-name}"
}
The process duration may vary, typically ranging from 5 to 10 minutes, depending on chosen options and the size or number of data locations.

DELETE operation

The DELETE operation deletes all assets created by the Purview connector. It accepts the properties listed below in JSON format within the request body.

Table 3. DELETE operation parameters
Name Description

dryRun

Executes a dry run if set to true, without deleting anything in Purview. Defaults to the corresponding property set in the function app configuration.

deletePhysicalAssets

If set to true, deletes the physical assets (i.e., tables and columns) associated with Semarchy logical assets. Defaults to false.

Setting this property to true deletes physical assets created by the built-in Purview scanners.
Example 4. POST operation sample request body
{
    "dryRun": true,
    "deletePhysicalAssets": false
}
DELETE operations cannot be undone.