Table of Contents

Welcome to Semarchy xDM Discovery.
This guide explains how to use xDM Discovery to perform source data profiling and analyze data when starting your data management initiative.

Preface

Overview

Using this guide, you will learn:

  • The architecture of xDM Discovery and its various components.
  • To configure and profile datasources in xDM Discovery.
  • To explore profiles using the built-in profiling dashboards.
  • To customize these dashboards for your use cases.

Audience

This document is intended for Data Architects and Business Users interested in using Semarchy xDM Discovery to build insight on their data.

To discover Semarchy xDM, you can watch our tutorials.
The Semarchy xDM Documentation Library, including the development, administration and installation guides is available online.

Document Conventions

This document uses the following formatting conventions:

ConventionMeaning

boldface

Boldface type indicates graphical user interface elements associated with an action, or a product specific term or concept.

italic

Italic type indicates special emphasis or placeholder variable that you need to provide.

monospace

Monospace type indicates code example, text or commands that you enter.

Other Semarchy Resources

In addition to the product manuals, Semarchy provides other resources available on its web site: http://www.semarchy.com.

Obtaining Help

There are many ways to access the Semarchy Technical Support. You can call or email our global Technical Support Center (support@semarchy.com). For more information, see http://www.semarchy.com.

Feedback

We welcome your comments and suggestions on the quality and usefulness of this documentation.
If you find any error or have any suggestion for improvement, please mail support@semarchy.com and indicate the title of the documentation along with the chapter, section, and page number, if available. Please let us know if you want a reply.

Introduction to Semarchy xDM Discovery

What is xDM Discovery?

Semarchy xDM Discovery enables data architects and business users to gather metrics and profile any source data to prepare a data management initiative.

xDM Discovery connects to datasources containing tables, profile the data in the tables and persists these profiles in the Semarchy xDM Repository. Users can analyze these profiles using built-in dashboards available from xDM Discovery.

For advanced profiling, data architect can seed, from the built-in dashboards, fully customizable Semarchy xDM Dashboards applications.

Profiles Metrics

xDM Discovery scans data tables and gathers in the profiles the following metrics:

  • Table metrics:
    • Number of records
  • Column metrics:
    • Lowest/highest values
    • Most frequent value
    • Uniqueness
    • Null count
    • Minimum, maximum and Average value length
    • Distinct value number and distribution
    • Pattern distribution and lowest/highest values

Using these metrics, data architects have a clear assessment of the data quality. From this assessment, they can discuss and infer the structure and data rules to apply to the data hub entities. xDM Discovery helps in the design phase of the data hub, before and while implementing the model in Semarchy xDM.

Architecture Overview

xDM Discovery uses the following components:

  • The Datasources are data stores containing the data to profile. A datasource may be directly connected to the application data to profile, or a staging location into which application data is loaded using an integration layer.
  • The Semarchy xDM repository stores the datasources' definitions and configuration as well as their profiling metrics.
  • The xDM Discovery Engine runs in the Semarchy xDM platform. It is a multi-threaded engine that runs profiling processes to gather datasources' metrics - the profiles - into the repository. This engine runs the profiling processes on demand or according to schedules.
  • The xDM Discovery User Interface allows data architects and business users to define and configure datasources, manage profiling processes, and browse profiles.

The xDM Discovery User Interface comes with built-in charts and dashboards to browse the profiles. These charts and dashboards can be forked into Semarchy xDM Dashboards for further customization. Alternately, you can connect a third-party visualization platform to create your own dashboards and visualizations on top of the profiles.

Setting Up xDM Discovery

xDM Discovery is installed by default with Semarchy xDM and does not require any specific step after the Semarchy xDM repository installation.

Configuring the Application Server

xDM Discovery connects to the data to profile via datasources configured in the application server running the Semarchy xDM application. Make sure to configure these application server datasources pointing to the data to profile before defining them in xDM Discovery.

Refer to the Semarchy xDM Installation Guide for detailed instructions about how to configure datasources for your application server.

Configuring Users and Roles

To access the xDM Discovery user interface, a user must have a role with the Discovery Management platform-level privilege.

Make sure to grant the Discovery Management privilege to the roles allowed to configure datasources and manage profiling processes.

Refer to the Managing the Security section in the Semarchy xDM Administration Guide for more information about application server datasource configuration.
A user without the appropriate privileges will not see xDM Discovery on the welcome page when logged in.

Working with xDM Discovery

xDM Discovery provides an interface to declare and profile datasources. It is available from the Semarchy xDM welcome page. Click the xDM Discovery icon to open this application in a web browser.
image

xDM Discovery appears as shown below:

image

The xDM Discovery interface includes:

  • The Navigation Drawer at the left-hand side. The navigation drawer shows the datasource shortcuts, and a Datasources item to open the list of all datasources.
  • The Application Header which displays:
    • The image Menu icon to hide/show the navigation drawer.
    • A breadcrumb to the datasource displayed or to the list of datasources.
    • The image user menu.
  • The main editor, showing the datasource being managed or a list of datasources
  • An optional secondary side nav at the right-hand side shows the properties of an object selected within the main editor.

Fields in xDM Discovery are documented within their editor. Move your cursor over the field and click the information icon to display the documentation for each field, as shown below.

image

Advanced Features

YAML Editing

Datasources definition and configuration in xDM Discovery are stored in a YAML form: a text representation of the object. You can access for all editors this YAML representation by clicking the image Switch to the YAML editor button in the editor toolbar.

Use YAML editor to quickly modify objects and perform find-replace type of operations.

Issue Reporting

xDM Discovery provides a built-in validation feature available as a Validate button in each editor’s toolbar to validate the YAML.

This validation displays an issue report describing the various issues found in the datasource definition.

Managing Datasources

When you connect to xDM Discovery, the datasources list appears:

  • When this list is empty, a Create Datasource button appears to create your first datasource.
  • When this list already contains datasources, you can click any datasource in the list to browse its profile, or click the image Edit action to configure this datasource.
  • You can switch the list to a table view by selecting the image Table view item in the image More menu. In the table view:
    • Click the column header to sort the datasources or use the image Sort menu item.
    • Customize the table contents with the image Select columns menu item.

Datasources in the datasource list support the following operations:

  • image View Dashboard to view the datasource profile.
  • image Edit to configure the datasource.
  • image Copy a selected datasource. When copying a datasource, you are prompted for the name of the new datasource.
  • image Rename a selected datasource. This operation changes the internal name of the datasource and not its label. When renaming a datasource, the incoming references to this datasource are not automatically updated.
  • image Delete selected datasources. This operation cannot be undone.
  • image Seed profiling app from a selection of datasource. See Customizing Applications in xDM Dashboards for more information.
Renaming or deleting a datasource changes or invalidates the URL used to browse this datasource. These two operations must be performed with care if you have shared this URL with other users.
Some of these operations are available from the navigation drawer, in the image More menu that is available on each datasource shortcut.

Creating a Datasource

To create a datasource:

  1. Select Datasources in the navigation drawer.
    The Datasources list opens.
  2. Click on the Add datasource floating action button in the lower-right corner of the screen.
  3. In the New Datasource dialog:
    1. Select the JNDI datasource name corresponding to a datasource configured in the application server.
      The Database type is automatically detected, and the Name and Label are automatically generated based on the JNDI datasource name.
    2. Change the Name and Label for the datasource.
    3. Click Create.

The datasource editor opens with the list of tables found in the database.

Configuring a Datasource

To configure a datasource:

  1. In the datasource editor, select the image Open datasource property item in the editor toolbar.
    The datasource properties sidesheet opens.
  2. Change the datasource label:
    • Click the image Rename button in the sidesheet header and enter a new label for the datasource.
  3. Configure the datasource properties:
    • In the Profiling Options:
      • Max processes define the maximum number of tables that can be simultaneously profiled.
    • In the Display Properties:
      • Select Show in shortcuts to have the datasource appear in the xDM Discovery navigation drawer.
      • Select an Icon and Icon color. This optional icon appears in the shortcuts aside the datasource label.
      • Select a Color for the datasource label.
    • In the Documentation section, edit the rich text describing the datasource.
  4. Click image Save to save your changes.

Managing Tables

The datasource contains the list of tables retrieved from the database at a certain point in time. If you have just created the datasource, or you have added new tables in the database, you can refresh the list of tables.

To refresh the list of tables:

  1. In the datasource editor, select the image More menu in the upper-right corner of the list of tables.
  2. Select image Refresh table list.

The list of tables refreshes:

  • Existing tables remain untouched and keep their profile.
  • New tables are added to the list with no profile. You may start profiling these tables
  • Tables that were removed from the database now appear with a not found indicator. If these tables had profiles, they are preserved until you choose to remove these tables.

The list of tables supports the following capabilities:

  • Sorting by clicking the column header.
  • Multiple selections and select all.
  • Filtering by table name using the image Filter button.

To remove tables from the list:

  1. In the datasource editor, select the tables that you want to remove.
  2. In the image More menu, select image Remove table.
  3. When you remove tables existing profiles, you must confirm the deletion of these profiles.
If you have removed by mistake a table, refresh the table list to restore it in the list.
Dropping a table profile deletes all profiling information for this table, which cannot be undone.

To configure table properties:

  1. In the datasource editor, click the table that you want to configure.
    The table properties sidesheet opens.
  2. Configure the datasource properties:
    • Select Scheduled to profile this table regularly on a schedule, and then define Cron expression defining the profiling schedule.
    • In the Profiling Options:
      • Row sampling defines size of the sample used for value and pattern analysis.
  3. Click image Save to save your changes.
To edit the properties for multiple tables, select these tables and then image Edit menu in the image More menu.
You can enable/disable the Scheduled flag on the tables directly from the list of tables.

Profiling Tables

To profile tables:

  1. In the datasource editor, select the tables that you want to profile.
  2. Select image Profile selection in the table toolbar.
    Profiling starts for the selected tables.

When a table is the Profiling or Queued status, you can stop its profiling processes by selecting it the list and then clicking image Cancel profiling from the image More menu.

The Profiling Process

When profiling runs for a table, its Profile Status is set to Profiling, followed by a progress percentage. At issue, the table state becomes Ready.

The Last profiled column indicates when the last successful profiling process ran for this table, and the View profile link opens the profile for this specific table.

Only a limited number of tables are profiled simultaneously. This limit is set by the Max processes property in the datasource configuration. If you start profiling more tables, some of them appear as Queued. They are moved to the Profiling state when the previous tables have been profiled.

If the profiling process fails for a table, for example, if the table was dropped from the database, then the status is Failed. A tooltip on the failed status icon shows the error message.

The Profile Status reflects the current state of profiling for each table in the datasource. An empty value reflects a table for which there is no profiling data. Other statuses are:

  • Ready: The latest profiling process ran successfully. Fresh profiling data is available for this table.
  • Profiling: A profiling process is running for the table.
  • Queued: A profiling process is queued for the table.
  • Failed: The latest profiling process failed. Previous profiling data may be available.
  • Canceled: The latest profiling process was stopped by the user.

Browsing the Profiles

Using xDM Discovery Built-in Dashboards

xDM Discovery includes built-in dashboards to visualize the profiling data.

Datasource Profile

The Datasource Profile dashboard shows the profiling information at the datasource level. It is available by clicking:

  • a datasource in the navigation drawer’s Shortcuts,
  • a datasource in the Datasources list,
  • the image View profile button in the datasource editor.

This dashboard contains the following tabs:

  • Profile shows the profiled tables status with their record count.
  • Table shows the columns of a given table, including lowest, highest, most frequent values, etc. Use the charts on this dashboard to infer the correct sizing for a column, the value range and mandatory nature the columns within a table.
  • Columns shows the detailed column metrics for a given table, including lowest, highest, most frequent values and other value metrics. It also exposes value and pattern distribution for the column. Use this dashboard to decide whether a column should be constrained to a list of values, to infer the nature (Phone Number, SSID, etc.), as well as the rules and format to apply to the column.
Table Profile

The Table Profile dashboard shows the metrics from the Table and Column tabs, focusing on a single table of the datasource. It appears when clicking the View profile link for a table from the dashboard editor.

Built-in dashboards have an image Edit datasource button in the upper left corner to return to the datasource editor.

Customizing Applications in xDM Dashboards

xDM Discovery lets you seed an application in xDM Dashboards that contains the built-in dashboards. You can customize this application in Dashboard Builder and create visualizations specifically for your use case.

To seed a profiling application:

  1. Select Datasources in the navigation drawer.
    The Datasources list opens.
  2. From the list, select the datasources that you want to include in your dashboard application.
  3. In the image More menu, select image Seed profiling app.
  4. Provide a Title and a Name for this application.
  5. Click Create
    The application is generated and opens in Dashboard Builder.

The generated application contains multiple charts and dashboard, some of which are not used in the application. Feel free to use and customize every component of this application.

Customizing the generated application does not alter the built-in dashboards. Re-generating the application in Dashboard Builder recreates it from the built-in dashboards.
Refer to the xDM Dashboards Designer’s Guide for more information about dashboard applications customization.

Using Your BI Tools with xDM Discovery

You can use your own BI Tool to connect and perform analysis on xDM Discovery profiles:

  • The Semarchy xDM repository, which stores the profile data, is hosted in a relational database accessible via JDBC, ODBC and other connectivity methods.
  • Appendix A contains a reference to help you build dashboards based on the tables storing the profile data.
  • The  built-in dashboards may be used as starting points and examples to build your own visualizations.
Make sure to access these schemas from your BI tools with credentials having SELECT privileges only. Do not attempt to insert or update data in these schemas.

Appendices

Appendix A: xDM Discovery Table Reference

The Semarchy xDM repository stores the profiling metrics in the tables listed in this appendix. Use this reference information to customize the existing charts or build your own visualizations on top of this data.

PRF_PROFILING

This table contains information about the profiling status of the tables.

Column NameDescription

DATA_SOURCE

text

Name of the datasource.

TABLE_NAME

text

Name of the table.

STATUS

text

Profiling status of the table. Refer to the profiling process section for more information.

  • DONE: Ready - The latest profiling process ran successfully.
  • IN PROGRESS: Profiling - A profiling process is running for the table.
  • QUEUED: Queued - A profiling process is queued for the table.
  • ERROR: Failed - The latest profiling process failed.
  • CANCELING: Temporary status while the engine processes a user cancellation.
  • CANCELED: Canceled - The latest profiling process was stopped by the user.
  • NONE: his corresponds to a profile that was never executed.

PROGRESS

number

Progress of a running profiling process (1..100)

ERROR_MESSAGE

text

Error message when the profiling process has failed.

QUEUED_DATE

datetime

Timestamp when the profiling process was queued.

START_DATE

datetime

Timestamp when the profiling process was started.

END_DATE

datetime

Timestamp when the profiling process was finished.

PRF_TABLE

This table contains profiling metrics for the tables.

Column NameDescription

DATA_SOURCE

text

Name of the datasource.

TABLE_NAME

text

Name of the table.

NUM_ROWS

number

Number of rows in the tables

UPDATE_DATE

datetime

Timestamp when the profile was updated.

PRF_COLUMN

This table contains profiling metrics for the columns.

These metrics are reported for the entire table. They are not limited by the table’s Row Sampling profiling option.
Column NameDescription

DATA_SOURCE

text

Name of the datasource.

TABLE_NAME

text

Name of the table.

COLUMN_NAME

text

Name of the profiled column.

ORDINAL_POSITION

number

Ordinal position of the column in the table.

DATA_TYPE

text

Datatype of the column.

NUM_ROWS

number

Number of rows in the table.

LOW_VAL

text

Lowest value found in the column for all records.

HIGH_VAL

text

Highest value found in the column for all records.

NUM_NULLS

number

Number of null value occurrences found for the column for all records.

NUM_DIST_VALS

number

Number of distinct values found for the column for all records.

MOST_FREQ_VAL

text

Most frequent value found for the column for all records.

AVG_VAL

text

Average value found for the column for all records.

PERCENTILE

text

Not loaded yet. Reserved for future use.

MIN_LENGTH

number

Minimum value length found for the column for all records.

MAX_LENGTH

number

Maximum value length found for the column for all records.

AVG_LENGTH

number

Average value length found for the column for all records.

Note that not all metrics are available for all datatypes. The following table lists the metrics available depending on the column datatype.

MetricApplicable datatype

NUM_ROWS

All

NUM_NULLS

All

LOW_VAL, HIGH_VAL

text, numeric, date & datetime

AVG_VAL

numeric, date & datetime

NUM_DIST_VALS

text, numeric, date & datetime, boolean

MOST_FREQ_VAL

text, numeric, date & datetime, boolean

MIN_LENGTH, MAX_LENGTH, AVG_LENGTH

text, longtext, binary

PERCENTILE

Not available yet.

Value distribution

text, numeric, date & datetime

Pattern distribution

text

PRF_DIST_VALS

This table contains the list of distinct values in the profiled columns.

Only a subset of the column values are reported. The distinct values reported for each column are limited by the table’s Row Sampling profiling option.
Column NameDescription

DATA_SOURCE

text

Name of the datasource.

TABLE_NAME

text

Name of the table.

COLUMN_NAME

text

Name of the profiled column.

DIST_VAL

text

Column value.

NUM_ROWS

number

Number of occurrences of the value found in the table.

PRF_DIST_PATTERNS

This table contains the list of distinct patterns found in the values of the profiled columns.

Only a subset of the column value patterns are reported. The patterns reported for each column are limited by the table’s Row Sampling profiling option.
Column NameDescription

DATA_SOURCE

text

Name of the datasource.

TABLE_NAME

text

Name of the table.

COLUMN_NAME

text

Name of the profiled column.

DIST_PATTERN

text

Pattern found in the column values.

NUM_ROWS

number

Number of occurrences of the pattern in the column.

HIGH_VAL

text

Highest value matching this pattern in the column.

LOW_VAL

text

Lowest value matching this pattern in the column.