Configure the database schemas

This page explains how to configure databases and schemas for the Semarchy xDM repository and data locations.

For Semarchy xDM to install and boot properly, the following four elements must be set on your database:

  • A repository storage—that is, internal storage used by Semarchy xDM to store the design-time and run-time information (recommended name: semarchy_repository)

  • A repository user—that is, a user to log in to the repository storage with write privileges (recommended name: semarchy_repository).

  • A repository read-only user—that is, a user to log in to the repository storage with read-only privileges (recommended name: semarchy_repository_readonly). It is used by Semarchy xDM Discovery’s built-in profile dashboards.

  • At least one data location storage that includes the database tables and other objects generated from the model edition of your application (e.g., customer_mdm, product_data_location).

Configure the repository storage

Repository storage

Before installing Semarchy xDM, you must create storage for the repository. This can be done either manually or using your database administration interface. Below are sample scripts for creating storage. Make sure to adapt the script according to your database configuration.

Create the repository schema (Oracle)
CREATE USER <repository_user> IDENTIFIED BY <repository_password>
 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE TO <repository_user>;

-- Use the following command for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO <repository_user>;
Create a database and the repository schema (PostgreSQL)
-- Create a database for the repository and data locations
CREATE DATABASE <postgresql_database_name> WITH ENCODING 'UTF8';

-- Disconnect and then reconnect using:
--   the JDBC URL:  jdbc:postgresql://<host>:<port>/<postgresql_database_name>
--   or using psql with the following command: psql -U postgres <postgresql_database_name>

CREATE SCHEMA extensions;
GRANT USAGE ON SCHEMA extensions TO PUBLIC;
ALTER DEFAULT PRIVILEGES IN SCHEMA extensions GRANT EXECUTE ON FUNCTIONS TO PUBLIC;
ALTER DATABASE <postgresql_database_name> SET SEARCH_PATH TO "$user",public,extensions;

CREATE EXTENSION IF NOT EXISTS "uuid-ossp"     with schema extensions;
CREATE EXTENSION IF NOT EXISTS "fuzzystrmatch" with schema extensions;

-- Create the repository user and schema
CREATE USER <repository_user> WITH PASSWORD '<repository_password>';

-- Use the following syntax for PostgreSQL 9
-- CREATE USER <repository_user> WITH UNENCRYPTED PASSWORD '<repository_password>';

-- The following command is required only for PostgreSQL running on Amazon RDS.
--   It grants access to the repository to the RDS superuser.
-- GRANT <repository_user> TO <rds_superuser_name>

CREATE SCHEMA <repository_user> AUTHORIZATION <repository_user>;
Create a repository database, login, and user (SQL Server)
-- Create a database for the repository
CREATE DATABASE <repository_database_name>
GO

-- Configure the database
ALTER DATABASE <repository_database_name> SET READ_COMMITTED_SNAPSHOT ON;
GO

ALTER DATABASE <repository_database_name> SET QUOTED_IDENTIFIER ON;
GO

-- Create a login to connect the database
CREATE LOGIN <repository_user> WITH PASSWORD='<repository_password>', DEFAULT_DATABASE=<repository_database_name>
GO

-- Add a user for that login in the database
USE <repository_database_name>
GO

CREATE USER <repository_user> FOR LOGIN <repository_user>
GO

-- Make this user database owner
ALTER ROLE db_owner ADD MEMBER <repository_user>
GO
Store the values of the <repository_user>, <repository_password>, and <repository_database_name> (for SQL Server), as you will need them later for the startup configuration.

Repository read-only user

In addition to the repository storage and credentials with full access to the repository schema, a user with limited access to the repository must be created.

This user is used by xDM Discovery’s built-in profile dashboards and should have at least read-only access to the tables in the repository schema that stores the profiling metrics. These tables have the PRF_ prefix.

Since the repository tables are not created yet, you must run the GRANT statements for this user on the repository tables after creating the repository. These grants are also provided in the post-installation steps.
Create the repository read-only user (Oracle)
-- Create the repository read-only user
CREATE USER <repository_readonly_user>
 IDENTIFIED BY <repository_readonly_password>
 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;

-- Grant minimum access to the user
GRANT CONNECT TO <repository_readonly_user>;

-- Run the following commands after the repository creation.
-- Grant select privileges on the profiling tables
GRANT SELECT ON <repository_user>.PRF_PROFILING TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_TABLE TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_COLUMN TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_DIST_VALS TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_DIST_PATTERNS TO <repository_readonly_user>;

-- The <repository_readonly_user> user must use the following statement to default their schema to the repository.
-- ALTER SESSION SET CURRENT_SCHEMA = <repository_user>;
Create the repository read-only user (PostgreSQL)
-- Create the repository read-only user
CREATE USER <repository_readonly_user> WITH PASSWORD '<repository_readonly_password>';

-- Use the following syntax for PostgreSQL 9
-- CREATE USER <repository_readonly_user> WITH UNENCRYPTED PASSWORD '<repository_readonly_password>';

GRANT CONNECT ON DATABASE <postgresql_database_name> to <repository_readonly_user>;

-- Set the search path to include the repository
ALTER ROLE <repository_readonly_user> SET SEARCH_PATH TO "$user", <repository_user>,public,extensions;

-- Run the following commands after the repository creation
-- Grant select privileges on the profiling tables
GRANT USAGE ON SCHEMA <repository_user> TO <repository_readonly_user>;
GRANT SELECT ON TABLE
    <repository_user>.PRF_PROFILING,
    <repository_user>.PRF_TABLE,
    <repository_user>.PRF_COLUMN,
    <repository_user>.PRF_DIST_VALS,
    <repository_user>.PRF_DIST_PATTERNS
TO <repository_readonly_user>;
Create the repository read-only user (SQL Server)
-- Create the repository read-only user
CREATE LOGIN <repository_readonly_user> WITH PASSWORD='<repository_readonly_password>', DEFAULT_DATABASE=<repository_database_name>
GO

-- Add a user for that login in the database
USE <repository_database_name>
GO

CREATE USER <repository_readonly_user> FOR LOGIN <repository_readonly_user>;
GO

-- Grant minimum access to the user
GRANT CONNECT TO <repository_readonly_user>;

-- Run the following commands after the repository creation
-- Grant select privileges on the profiling tables
GRANT SELECT ON PRF_PROFILING TO <repository_readonly_user>;
GRANT SELECT ON PRF_TABLE TO <repository_readonly_user>;
GRANT SELECT ON PRF_COLUMN TO <repository_readonly_user>;
GRANT SELECT ON PRF_DIST_VALS TO <repository_readonly_user>;
GRANT SELECT ON PRF_DIST_PATTERNS TO <repository_readonly_user>;
Store the values of the <repository_readonly_user> and <repository_readonly_password>, as you will need them later for the startup configuration.

Configure the data locations storage

Data locations' databases or schemas do not necessarily have to be created at the time of installation, but it is recommended to anticipate their creation as part of the installation and configuration efforts. You can create them manually or use your database administration interface for this purpose.
Below are sample scripts for creating a data location database or schema. Make sure to adapt this script to your database configuration and duplicate it to create storage for all data locations.

Create a data location schema (Oracle)
CREATE USER <data_location_user_name> IDENTIFIED BY <data_location_user_password>
 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE, CREATE VIEW TO <data_location_user_name>;

-- Use the following command for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO <data_location_user_name>;

-- Grant data-loading privileges to the data location owner
GRANT EXECUTE ON <repository_database_schema>.INTEGRATION_LOAD TO <data_location_user_name>;
The CREATE VIEW privilege is required to deploy database views.
Create a data location schema (PostgreSQL)
CREATE USER <data_location_user_name> WITH PASSWORD '<data_location_user_password>';

-- Use the following syntax for PostgreSQL 9
-- CREATE USER <data_location_user_name> WITH UNENCRYPTED PASSWORD '<data_location_user_password>';

-- The following command is required only for PostgreSQL running on Azure or Amazon RDS.
--   It grants access to the data location to the Azure or RDS superuser.
-- GRANT <data_location_user_name> TO <azure|rds_superuser_name>

CREATE SCHEMA <data_location_user_name> AUTHORIZATION <data_location_user_name>;

-- Grant data-loading privileges to the data location owner
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA <repository_schema> TO <data_location_user_role>;
Create a data location database, login, and user (SQL Server)
-- Create a database for the data location
CREATE DATABASE <data_location_database_name>
GO

-- Configure the database
ALTER DATABASE <data_location_database_name> SET READ_COMMITTED_SNAPSHOT ON;
GO

ALTER DATABASE <data_location_database_name> SET QUOTED_IDENTIFIER ON;
GO

-- Create a login to connect the database
CREATE LOGIN <data_location_user_name> WITH PASSWORD='<data_location_user_password>', DEFAULT_DATABASE=<data_location_database_name>
GO

-- Add a user for that login in the database
USE <data_location_database_name>
GO

CREATE USER <data_location_user_name> FOR LOGIN <data_location_user_name>
GO

-- Make this user database owner
ALTER ROLE db_owner ADD MEMBER <data_location_user_name>
GO

-- Grant data-loading privileges to the data location owner
USE <repository_database_name>;
GRANT EXECUTE ON SCHEMA :: dbo TO <data_location_user_name>;
Store the values of the <data_location_user_name>, <data_location_user_password>, and <data_location_database> (for SQL Server), as you will need them later for creating the data location datasource.

Database-specific considerations

General considerations

Configure the database with a charset that supports all languages, such as AL32UTF8 for Oracle and UTF8 for PostgreSQL. Semarchy xDM uses specific characters for storing internal information. Applications in Semarchy xDM natively support multi-lingual data without preventing users from entering accented characters (or Cyrillic, Arabic, or Chinese characters). A database configured for a specific language or with a limited charset may not function optimally with Semarchy xDM.

Oracle

Repositories and data locations should be located in separate schemas. However, they do not necessarily need to be located in the same database.

Semarchy xDM ships with an Oracle JDBC driver (ojdbc8.jar) for Oracle Database 12c Release 2 (12.2.x). This driver is strongly recommended for all recent database versions. If you are using an older Oracle version (11g), it is recommended to review the compatibility of this driver with your Oracle database version and possibly install an older driver version instead (ojdbc6 or ojdbc7).

To increase the maximum string length from 4,000 to 32,767 bytes on an Oracle database, platform administrators can set the MAX_STRING_SIZE parameter to EXTENDED in the Oracle database settings, and add com.semarchy.mdm.supportExtendedMaxStringSize to xDM system properties.

Once set to EXTENDED, the MAX_STRING_SIZE parameter cannot be set back to STANDARD. Likewise, the com.semarchy.mdm.supportExtendedMaxStringSize property cannot be reverted to false once it has been set to true.

PostgreSQL

Repositories and data locations should be located in separate schemas.

SQL Server

The configuration presented above uses logins defined in the database. Modify this script to use Windows or AD logins as needed.

Semarchy xDM does not support schemas for instances of SQL Server. A separate database is necessary for each repository and data location, and each user logging into these databases should have the database owner role.

For the following reasons, exercise caution when configuring the database instance’s collation:

  • Collation determines code page, case sensitivity (CS/CI), and accent sensitivity (AS/AI). It has a significant impact on comparison functions, order by clauses, and execution performance. During comparisons, any character falling outside the defined collation code page is considered an "unknown character". Since two different unknown characters are consistently treated as different from each other, this may cause unexpected complications in comparisons.

  • To ensure support for all characters you intend to use or store in the hub, the collation for the repository’s hosting database should be selected judiciously. Semarchy xDM internally utilizes the following special characters (identified by their Unicode number): £ (U+00A3), $ (U+0024), ¤ (U+00A4), • (U+2022). These characters should be also supported by the collation.

Consider using a UTF-8 collation if supported by your SQL Server version.
  • It is recommended to select a case-sensitive collation for the repository and data locations database. Using a case-insensitive collation may cause unexpected results in search operations.

the SEM_NORMALIZE function used is made collation-proof and forces the Latin1_General_100_CS_AS_KS_WS_SC collation.

Additionally, SQL Server repository and data location databases should be configured as follows for Semarchy xDM:

  • QUOTED_IDENTIFIER should be set to ON to force SQL Server to follow the ISO rules for identifiers and literal values quoting, using the following command:

    ALTER DATABASE <database_name> SET QUOTED_IDENTIFIER ON;
  • READ_COMMITTED_SNAPSHOT should be set to ON to allow connections to access the previous (committed) version of the records being modified instead of waiting for them to be unlocked, using the following command:

    ALTER DATABASE <database_name> READ_COMMITTED_SNAPSHOT ON;

Size and maintain the databases and schemas

Repository

The following considerations should be taken into account when sizing repository databases and schemas:

  • The repository’s overall space is divided among various components, including model-related data—​whose volume is relatively modest and depends on the model size; the execution log—​which includes data on jobs, job instances, and steppers; and data from the deployment history and model editions, encompassing even closed model editions.

  • Determining the storage size for a repository involves considering factors such as the number of batches processed, the volume of data generated during model executions, and the dynamic nature of development activities. As a standard recommendation, we suggest allocating a minimum of 20 GB of disk space for the repository schema. While hosting models, applications, and dashboard metadata may not require that much space initially, it allows for proper monitoring setup and potential future expansion. Moreover, real-world observations show that, over time, repositories often exceed this size significantly, especially with continuous development activities. Therefore, it is advisable to consider a more substantial storage allocation to accommodate potential growth.

  • To maintain the execution log at a manageable volume, it is recommended to perform regular purges. Configuring sensible retention policies within models and scheduling periodic data purges in data locations helps control the repository’s overall size.

Data location

The following considerations should be taken into account when sizing the data location databases/schemas:

  • For each entity of the deployed model, the data location contains several tables that correspond to the data in the various stages of the certification process. Some of these tables also contain the history of the previous iterations of the process.

  • The volume required in the data location depends on the following factors:

    • The number of entities.

    • The number of source records pushed for these entities.

    • The number of new or updated source records pushed for these entities.

A recommended original sizing is to add the source data volume pushed for each entity by all publishers plus one data authoring (the overall input) and multiply it by a factor of 10. After the original sizing, it is recommended to monitor the size of the data location in the normal course of operations and adjust the sizing appropriately.

The same sizing considerations apply to both the data and temporary tablespaces or filegroups in the case of the data locations, as the database engine is used for most of the processing effort in the certification process.

Data retention policies can be created to define the volume of data to retain in the data locations, and data purges can be scheduled to trigger the pruning of unnecessary data.