Configure platform datasources

This page provides details to configure datasources for Semarchy xDM.

A platform datasource defines a connection to a database schema used by Semarchy xDM features. Such a connection uses Java Database Connectivity (JDBC).

Each datasource has an associate connection pool to provide Semarchy xDM features with ready-to-use connections.

Overview

Administrators configure platform datasources in the xDM Configuration user interface for usage by specific platform features, such as data locations, dashboards, xDM Discovery, variable value providers, plug-ins, etc.

Specific connections

Certain database connections performed by Semarchy xDM are not configured as platform datasources:

Datasource usages

A datasource is configured with specific database credentials. As such, it has access to a certain number of resources and operations in the database.

For each datasource, you can configure a set of usages. Usages restrict the features of Semarchy xDM that can use the datasource.

Datasource Usage Example

A CustomerDataHub datasource is configured to deploy a data location and has the data location usage. The required database privileges for a data location include DDL and CRUD operations on all tables in the schema.

Such privileges are excessive for dashboarding. As a consequence, this datasource should not have the dashboard usage.

A second CustomerDataDashboards datasource should be created, using different credentials and fewer privileges (e.g., SELECT on a subset of the tables). This second datasource would have the dashboard usage.

Datasources defined for usage with xDM Dashboard or xDM Discovery should preferably authenticate with users having limited privileges. Typically, they should only have read access, and possibly only to a subset of the tables and views.

Connection pools

Each datasource has a built-in connection pool. The connection pool maintains a set of ready-to-use connections. Features using a datasource request available connections from this pool as needed, and release these connections into the pool when done.

The pool creates new connections or removes idle connections automatically, based on the Minimum idle and Maximum Pool Size parameters, as well as the advanced properties.

Configure a datasource

A JDBC driver is required to connect a database. Drivers are provided by the database vendors as JAR files, that must be installed in the application server. For example, in the lib/ sub-folder of the Tomcat installation. For more details, see the application-server-specific installation instructions.

Datasource configuration is a platform-level administrative task performed in Semarchy Configuration.

To create a datasource:

  1. In Configuration, from the navigation drawer, select Datasources.
    The list of datasources opens.

  2. Click the Add datasource Add Datasource floating action button.

  3. In the New Datasource dialog, enter the datasource name.

  4. Select the Database Type corresponding to your database engine technology. If your technology is not listed, select Other.

  5. Click Create.

  6. In the Datasource editor, set the following properties:

    • Driver: the fully-qualified Java class name of the JDBC driver (e.g., org.postgresql.Driver for a PostgreSQL database).

    • URL: the JDBC connection URL. This URL depends on the driver (e.g., a PostgreSQL database URL has the following format: jdbc:postgresql://<host>:<port>/<database>).

    • Username and Password: the credentials used to establish the connection.

  7. Configure the Minimum Idle and Maximum Pool Size to define the size of the connection pool.

  8. (Optional) Configure the advanced properties of the connection pool.

  9. (Optional) Configure the driver properties specific to the database technology. For the comprehensive list of properties, see the driver documentation.

  10. Select the Datasource Usages that you want to allow.

  11. Save the datasource, and then click the Test datasource Test button to test it.

Changes performed in the datasource editor immediately apply.

Advanced properties

The following table lists the advanced properties to configure datasources.

Connection Timeout

Maximum number of milliseconds to wait to establish a connection. The default value is set to 30,000 (30 seconds).

Idle Timeout

Amount of time in milliseconds allowed for a connection to sit in the pool before being retired. Set to 0 to disable idle connections retirement. The default value is set to 600,000 (10 minutes).

Keep Alive Time

Frequency of the attempts to keep idle connections alive, to prevent database/network timeout, possibly using a connection test. Set this property to 0 to disable that feature.

Max Lifetime

Maximum lifetime of a connection before it is retired from the pool. The default value is set to 180,0000 (30 minutes).

Connection Test Query

SQL query used to test or keep a connection alive. When unset, the Connection.isValid() JDBC method is used.

Minimum Idle

Minimum number of idle connections to maintain in the pool. The default value is set to 1.

Maximum Pool Size

Maximum number of idle and in-use connections in the pool. This property must be greater than the Minimum Idle property. The default value is set to 8.

Connection Initialization SQL

SQL query executed when creating a connection and adding it to the pool.

Validation Timeout

Maximum number of milliseconds to wait for a connection test to succeed. The default value is set to 5,000 (5 seconds)

Driver properties

When configuring datasources for certain technologies, specific driver properties must be set.

Note that these properties are automatically set when creating a new datasource.

Oracle

A datasource to connect an Oracle database must be configured with the oracle.jdbc.J2EE13Compliant property set to true, to make the Oracle driver behave in a Java EE-compliant manner.

When this option is not set, errors such as the following one will be raised in the application log.

com.semarchy.mdm.runtime.data.InvalidDataAccessResourceUsageException: java.lang.RuntimeException: Unexpected DB value... (Class oracle.sql.TIMESTAMP for logicalType TIMESTAMP)

PostgreSQL

The default fetch size of the PostgreSQL JDBC driver is zero. This means a select statement with no limit or pagination logic returns the entire record set to the application server and stores it in a buffer, possibly causing excessive memory consumption.

It is recommended to set the defaultRowFetchSize property to 2000 for PostgreSQL datasources, particularly for those used for data locations.

Configure replicas

When deploying Semarchy xDM in distinct environments, you can configure, for each datasource, a different configuration for each environment.

To configure environment-specific configurations for a datasource, you can create replicas for this datasource, each replica is identified by an environment ID.

Datasource replicas for development and production

In your original environment, you address a DataHub datasource with a given URL and specific credentials. In the various production environments (PROD_US_WEST, PROD_US_EAST, etc) you want to use different URLs and credentials.

You can configure that datasource by default for the original environment, and configure replicas for the various production environments

Create a replica

To create a replica:

  1. In the datasource editor, select the Replicas tab. The list of replicas opens.

  2. Click the Add replica Add Replica button.

  3. In the New Replica dialog, enter the Environment ID.
    Note that you cannot have two replicas with the same environment ID for a given datasource.

  4. The replica editor opens, seeded with the configuration of the base datasource.
    Modify this configuration to meet the environment requirements. The replica exposes the same properties as the base datasource.

  5. Save the datasource, and then click the Test datasource Test button to test the replica configuration.

Use replicas

When starting an Semarchy xDM instance, you can specify the environment into which the instance runs using the xdm.datasources.environmentid startup configuration property.
You can set this property:

  • Using an environment variable:

    export XDM_DATASOURCES_ENVIRONMENTID=PROD
  • Using a Java system property:

    -Dxdm.datasources.environmentid=PROD
  • In the startup configuration file:

    xdm.datasources.environmentid=PROD

When the instance starts:

  • With no environment ID set, each datasource uses its base configuration, and replicas are not used.

  • with an environment ID set, each datasource uses the configuration of the replica for this environment ID instead of the base datasource configuration. If no replica exists for the environment ID, the base datasource configuration is used instead.

Database read replicas

You can use datasource replicas to configure and run Semarchy xDM in environments with database read replicas—that is, read-only copies of writable database instances.

Database read replicas are supported:

  • In dashboard datasources.

  • In data location datasources, only for read-only operations such as querying hub data with the REST API.

    If a datasource replica points to a fully read-only schema, features that require writing data—​such as authoring operations and certain browsing features reliant on writing technical data in a data location—​will fail. This limitation also extends to exporting data and saving filters.

Configure datasources using the REST API

Endpoints are available on the Semarchy xDM REST API to consult and configure datasources.

For more details, see the REST API documentation.

Datasource tuning

When configuring datasources for Semarchy xDM—and more specifically in production environments—​you must take into account several considerations listed below.

Connection pool sizing

Datasources configured for Semarchy use connection pools, sized according to the expected usage.

In the normal course of operation of Semarchy:

  • Each simultaneous user session connected to and interacting with an MDM application makes one connection to the data location datasource and possibly one to the repository when working with steppers.

  • Each simultaneous integration job makes one connection to the data location datasource, plus one to the repository datasource.

  • Each chart rendered by xDM Dashboard makes one connection to the datasource containing the data when it renders.

  • Each profiling process running in xDM Discovery makes one connection to the datasource containing the data it profiles and one connection to the repository.

  • Each chart rendered on an xDM Discovery profiling chart makes one connection to the repository.

You must configure the pool size as a trade-off between two directions:

  • If the pool creates and keeps too many connections, it will be ready to serve any client immediately, but it will also overuse resources by proactively creating too many connections.

  • If the pool does not have enough connections available, then the pool may be exhausted at certain times, causing client requests to wait until a connection becomes available in the pool. In a nutshell, the more users try to access a small pool and the longer their queries, the more they will have to wait between each click.

The connection pool size for each datasource is configured using the Minimum Idle and Maximum Pool Size parameters.

Connection management

When Semarchy takes a connection from the connection pool, it assumes that this connection is valid. In certain situations, the connection may be invalid at the database side (e.g., for maintenance reasons). It is a good practice to configure the pool to test connections before serving them.

This capability is enabled using the Connection Test Query parameter, which provides a SQL query used to test or keep a connection alive. If this parameter is unset, the Connection.isValid() JDBC method is used to test the connection before serving it from the pool.