Welcome to Semarchy xDM Discovery.
This guide explains how to use xDM Discovery to perform source data profiling and analyze data when starting your data management initiative.
Preface
Overview
Using this guide, you will learn:
- The architecture of xDM Discovery and its various components.
- To configure and profile datasources in xDM Discovery.
- To explore profiles using the built-in profiling dashboards.
- To customize these dashboards for your use cases.
Audience
Document Conventions
This document uses the following formatting conventions:
Convention | Meaning |
---|---|
boldface | Boldface type indicates graphical user interface elements associated with an action, or a product specific term or concept. |
italic | Italic type indicates special emphasis or placeholder variable that you need to provide. |
| Monospace type indicates code example, text or commands that you enter. |
Other Semarchy Resources
In addition to the product manuals, Semarchy provides other resources available on its web site: https://www.semarchy.com.
Obtaining Help
There are many ways to access the Semarchy Technical Support. You can call or email our global Technical Support Center (support@semarchy.com). For more information, see https://www.semarchy.com.
Feedback
We welcome your comments and suggestions on the quality and usefulness
of this documentation.
If you find any error or have any suggestion for improvement, please
mail support@semarchy.com and indicate the title of the documentation
along with the chapter, section, and page number, if available. Please
let us know if you want a reply.
Introduction to Semarchy xDM Discovery
What is xDM Discovery?
Semarchy xDM Discovery enables data architects and business users to gather metrics and profile any source data to prepare a data management initiative.
xDM Discovery connects to datasources containing tables, profile the data in the tables and persists these profiles in the Semarchy xDM Repository. Users can analyze these profiles using built-in dashboards available from xDM Discovery.
For advanced profiling, data architect can seed, from the built-in dashboards, fully customizable Semarchy xDM Dashboards applications.
Profiles Metrics
xDM Discovery scans data tables and gathers in the profiles the following metrics:
- Table metrics:
- Number of records
- Column metrics:
- Lowest/highest values
- Most frequent value
- Uniqueness
- Null count
- Minimum, maximum and Average value length
- Distinct value number and distribution
- Pattern distribution and lowest/highest values
Using these metrics, data architects have a clear assessment of the data quality. From this assessment, they can discuss and infer the structure and data rules to apply to the data hub entities. xDM Discovery helps in the design phase of the data hub, before and while implementing the model in Semarchy xDM.
Architecture Overview
xDM Discovery uses the following components:
- The Datasources are data stores containing the data to profile. A datasource may be directly connected to the application data to profile, or a staging location into which application data is loaded using an integration layer.
- The Semarchy xDM repository stores the datasources' definitions and configuration as well as their profiling metrics.
- The xDM Discovery Engine runs in the Semarchy xDM platform. It is a multi-threaded engine that runs profiling processes to gather datasources' metrics - the profiles - into the repository. This engine runs the profiling processes on demand or according to schedules.
- The xDM Discovery User Interface allows data architects and business users to define and configure datasources, manage profiling processes, and browse profiles.
The xDM Discovery User Interface comes with built-in charts and dashboards to browse the profiles. These charts and dashboards can be forked into Semarchy xDM Dashboards for further customization. Alternately, you can connect a third-party visualization platform to create your own dashboards and visualizations on top of the profiles.
Setting Up xDM Discovery
xDM Discovery is installed by default with Semarchy xDM and does not require any specific step after the Semarchy xDM repository installation.
Configuring the Application Server
xDM Discovery connects to the data to profile via datasources configured in the application server running the Semarchy xDM application. Make sure to configure these application server datasources pointing to the data to profile before defining them in xDM Discovery.
Configuring Users and Roles
To access the xDM Discovery user interface, a user must have a role with the Discovery Management platform-level privilege.
Make sure to grant the Discovery Management privilege to the roles allowed to configure datasources and manage profiling processes.
Working with xDM Discovery
xDM Discovery provides an interface to declare and profile datasources. It is available from the Semarchy xDM welcome page. Click the xDM Discovery icon to open this application in a web browser.
xDM Discovery appears as shown below:
The xDM Discovery interface includes:
- The Navigation Drawer at the left-hand side. The navigation drawer shows the datasource shortcuts, and a Datasources item to open the list of all datasources.
- The Application Header which displays:
- The Menu icon to hide/show the navigation drawer.
- A breadcrumb to the datasource displayed or to the list of datasources.
- The user menu.
- The main editor, showing the datasource being managed or a list of datasources
- An optional secondary side nav at the right-hand side shows the properties of an object selected within the main editor.
Fields in xDM Discovery are documented within their editor. Move your cursor over the field and click the information icon to display the documentation for each field, as shown below.
Advanced Features
YAML Editing
Datasources definition and configuration in xDM Discovery are stored in a YAML form: a text representation of the object. You can access for all editors this YAML representation by clicking the Switch to the YAML editor button in the editor toolbar.
Use YAML editor to quickly modify objects and perform find-replace type of operations.
Issue Reporting
xDM Discovery provides a built-in validation feature available as a Validate button in each editor’s toolbar to validate the YAML.
This validation displays an issue report describing the various issues found in the datasource definition.
Managing Datasources
When you connect to xDM Discovery, the datasources list appears:
- When this list is empty, a Create Datasource button appears to create your first datasource.
- When this list already contains datasources, you can click any datasource in the list to browse its profile, or click the Edit action to configure this datasource.
- You can switch the list to a table view by selecting the Table view item in the More menu. In the table view:
- Click the column header to sort the datasources or use the Sort menu item.
- Customize the table contents with the Select columns menu item.
Datasources in the datasource list support the following operations:
- View Dashboard to view the datasource profile.
- Edit to configure the datasource.
- Copy a selected datasource. When copying a datasource, you are prompted for the name of the new datasource.
- Rename a selected datasource. This operation changes the internal name of the datasource and not its label. When renaming a datasource, the incoming references to this datasource are not automatically updated.
- Delete selected datasources. This operation cannot be undone.
- Seed profiling app from a selection of datasource. See Customizing Applications in xDM Dashboards for more information.
Creating a Datasource
To create a datasource:
- Select Datasources in the navigation drawer.
The Datasources list opens. - Click on the Add datasource floating action button in the lower-right corner of the screen.
- In the New Datasource dialog:
- Select the JNDI datasource name corresponding to a datasource configured in the application server.
The Database type is automatically detected, and the Name and Label are automatically generated based on the JNDI datasource name. - Change the Name and Label for the datasource.
- Click Create.
- Select the JNDI datasource name corresponding to a datasource configured in the application server.
The datasource editor opens with the list of tables found in the database.
Configuring a Datasource
To configure a datasource:
- In the datasource editor, select the Open datasource property item in the editor toolbar.
The datasource properties sidesheet opens. - Change the datasource label:
- Click the Rename button in the sidesheet header and enter a new label for the datasource.
- Configure the datasource properties:
- In the Profiling Options:
- Max processes define the maximum number of tables that can be simultaneously profiled.
- In the Display Properties:
- Select Show in shortcuts to have the datasource appear in the xDM Discovery navigation drawer.
- Select an Icon and Icon color. This optional icon appears in the shortcuts aside the datasource label.
- Select a Color for the datasource label.
- In the Documentation section, edit the rich text describing the datasource.
- In the Profiling Options:
- Click Save to save your changes.
Managing Tables
The datasource contains the list of tables retrieved from the database at a certain point in time. If you have just created the datasource, or you have added new tables in the database, you can refresh the list of tables.
To refresh the list of tables:
- In the datasource editor, select the More menu in the upper-right corner of the list of tables.
- Select Refresh table list.
The list of tables refreshes:
- Existing tables remain untouched and keep their profile.
- New tables are added to the list with no profile. You may start profiling these tables
- Tables that were removed from the database now appear with a not found indicator. If these tables had profiles, they are preserved until you choose to remove these tables.
The list of tables supports the following capabilities:
- Sorting by clicking the column header.
- Multiple selections and select all.
- Filtering by table name using the Filter button.
To remove tables from the list:
- In the datasource editor, select the tables that you want to remove.
- In the More menu, select Remove table.
- When you remove tables existing profiles, you must confirm the deletion of these profiles.
To configure table properties:
- In the datasource editor, click the table that you want to configure.
The table properties sidesheet opens. - Configure the table properties:
- Select Scheduled to profile this table regularly on a schedule, and then define Cron expression defining the profiling schedule.
- In the Profiling Options:
- Maximum distinct values stored defines the maximum number of values stored for distinct value analysis.
- Maximum patterns stored defines the maximum number of patterns stored for pattern analysis.
- Click Save to save your changes.
Profiling Tables
To profile tables:
- In the datasource editor, select the tables that you want to profile.
- Select Profile selection in the table toolbar.
Profiling starts for the selected tables.
When a table is the Profiling or Queued status, you can stop its profiling processes by selecting it the list and then clicking Cancel profiling from the More menu.
The Profiling Process
When profiling runs for a table, its Profile Status is set to Profiling, followed by a progress percentage. At issue, the table state becomes Ready.
The Last profiled column indicates when the last successful profiling process ran for this table, and the View profile link opens the profile for this specific table.
Only a limited number of tables are profiled simultaneously. This limit is set by the Max processes property in the datasource configuration. If you start profiling more tables, some of them appear as Queued. They are moved to the Profiling state when the previous tables have been profiled.
If the profiling process fails for a table, for example, if the table was dropped from the database, then the status is Failed. A tooltip on the failed status icon shows the error message.
The Profile Status reflects the current state of profiling for each table in the datasource. An empty value reflects a table for which there is no profiling data. Other statuses are:
- Ready: The latest profiling process ran successfully. Fresh profiling data is available for this table.
- Profiling: A profiling process is running for the table.
- Queued: A profiling process is queued for the table.
- Failed: The latest profiling process failed. Previous profiling data may be available.
- Canceled: The latest profiling process was stopped by the user.
Browsing the Profiles
Using xDM Discovery Built-in Dashboards
xDM Discovery includes built-in dashboards to visualize the profiling data.
Datasource Profile
The Datasource Profile dashboard shows the profiling information at the datasource level. It is available by clicking:
- a datasource in the navigation drawer’s Shortcuts,
- a datasource in the Datasources list,
- the View profile button in the datasource editor.
This dashboard contains the following tabs:
- Profile shows the profiled tables status with their record count.
- Table shows the columns of a given table, including lowest, highest, most frequent values, etc. Use the charts on this dashboard to infer the correct sizing for a column, the value range and mandatory nature the columns within a table.
- Columns shows the detailed column metrics for a given table, including lowest, highest, most frequent values and other value metrics. It also exposes value and pattern distribution for the column. Use this dashboard to decide whether a column should be constrained to a list of values, to infer the nature (Phone Number, SSID, etc.), as well as the rules and format to apply to the column.
Table Profile
The Table Profile dashboard shows the metrics from the Table and Column tabs, focusing on a single table of the datasource. It appears when clicking the View profile link for a table from the dashboard editor.
Customizing Applications in xDM Dashboards
xDM Discovery lets you seed an application in xDM Dashboards that contains the built-in dashboards. You can customize this application in Dashboard Builder and create visualizations specifically for your use case.
To seed a profiling application:
- Select Datasources in the navigation drawer.
The Datasources list opens. - From the list, select the datasources that you want to include in your dashboard application.
- In the More menu, select Seed profiling app.
- Provide a Title and a Name for this application.
- Click Create
The application is generated and opens in Dashboard Builder.
The generated application contains multiple charts and dashboard, some of which are not used in the application. Feel free to use and customize every component of this application.
Using Your BI Tools with xDM Discovery
You can use your own BI Tool to connect and perform analysis on xDM Discovery profiles:
- The Semarchy xDM repository, which stores the profile data, is hosted in a relational database accessible via JDBC, ODBC and other connectivity methods.
- Appendix A contains a reference to help you build dashboards based on the tables storing the profile data.
- The built-in dashboards may be used as starting points and examples to build your own visualizations.
Appendices
Appendix A: xDM Discovery Table Reference
The Semarchy xDM repository stores the profiling metrics in the tables listed in this appendix. Use this reference information to customize the existing charts or build your own visualizations on top of this data.
PRF_PROFILING
This table contains information about the profiling status of the tables.
Column Name | Description | |
---|---|---|
| text | Name of the datasource. |
| text | Name of the table. |
| text | Profiling status of the table. Refer to the profiling process section for more information.
|
| number | Progress of a running profiling process (1..100) |
| text | Error message when the profiling process has failed. |
| datetime | Timestamp when the profiling process was queued. |
| datetime | Timestamp when the profiling process was started. |
| datetime | Timestamp when the profiling process was finished. |
PRF_TABLE
This table contains profiling metrics for the tables.
Column Name | Description | |
---|---|---|
| text | Name of the datasource. |
| text | Name of the table. |
| number | Number of rows in the tables |
| datetime | Timestamp when the profile was updated. |
PRF_COLUMN
This table contains profiling metrics for the columns.
Column Name | Description | |
---|---|---|
| text | Name of the datasource. |
| text | Name of the table. |
| text | Name of the profiled column. |
| number | Ordinal position of the column in the table. |
| text | Datatype of the column. |
| number | Number of rows in the table. |
| text | Lowest value found in the column for all records. |
| text | Highest value found in the column for all records. |
| text | Average value found for the column for all records. |
| text | Most frequent value found for the column for all records. |
| number | Minimum value length found for the column for all records. |
| number | Maximum value length found for the column for all records. |
| number | Average value length found for the column for all records. |
| number | Number of null value occurrences found in the column for all records. |
| number | Number of distinct values (different values) found in the column for all records. |
| number | Number of unique values (values that appear once) found in the column for all records. |
| number | Number of records with values unique for the column. |
| number | Number of non-unique values (values that appear two or more times) found in the column for all records. |
| number | Number of records with values non-unique for the column. |
| text | Not loaded yet. Reserved for future use. |
Note that not all metrics are available for all datatypes. The following table lists the metrics available depending on the column datatype.
Metric | Applicable datatype |
---|---|
| All |
| text, numeric, date & datetime |
| numeric, date & datetime |
| text, numeric, date & datetime, boolean |
| text, longtext, binary |
| text, numeric, date & datetime, boolean |
| Not available yet. |
Value distribution | text, numeric, date & datetime |
Pattern distribution | text |
PRF_DIST_VALS
This table contains the list of distinct values in the profiled columns.
Column Name | Description | |
---|---|---|
| text | Name of the datasource. |
| text | Name of the table. |
| text | Name of the profiled column. |
| text | Column value. |
| number | Number of occurrences of the value found in the table. |
PRF_DIST_PATTERNS
This table contains the list of distinct patterns found in the values of the profiled columns.
Column Name | Description | |
---|---|---|
| text | Name of the datasource. |
| text | Name of the table. |
| text | Name of the profiled column. |
| text | Pattern found in the column values. |
| number | Number of occurrences of the pattern in the column. |
| text | Highest value matching this pattern in the column. |
| text | Lowest value matching this pattern in the column. |