Spark Component Release Notes

This page lists the main features added to the Spark Component.

Feature Highlights

Version 2.1.0

This version contains some fixed issues which can be found on the full change log.

Version 2.0.5

Handle spark session configuration when multiple targets are loaded with Spark

Spark will always use the same session within a single Spark Program, with previous version, when the session was created then it could not be configured anymore. As a consequence, we must know all the various session configurations at the beginning of the execution.

Spark component has been updated to be able to handle spark session configuration when multiple targets are loaded with Spark. Spark Submit TOOL

When submitting a Spark job, it is possible to specify one or more resource files that will be made available for each Spark worker.

The Spark Submit TOOL has been updated to allow the ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL.

Ability to specify deploy mode (cluster or client)

A new field has been added on the Spark Metadata to allow to specify deploy mode (cluster or client) of the Spark program, when using spark-submit.

Version 2.0.4

Minor improvements and fixed issues

This version contains some minor improvements and fixed issues, which can be found in the complete changelog.

Version 2.0.3

Minor improvements and fixed issues

This version contains some minor improvements and fixed issues, which can be found in the complete changelog.

Version 2.0.2

Change Data Capture (CDC)

Multiple improvements have been performed to homogenize the usage of Change Data Capture (CDC) in the various Components.

Parameters have been homogenized, so that all Templates should now have the same CDC Parameters, with the same support of features.

Multiple fixes have also been performed to correct CDC issues. Refer to the changelog for the exact list of changes.

Version 2.0.1

New Templates to load data from and into Elasticsearch

Two new dedicated Templates have been added to load data from Elasticsearch into Spark, and to load data from Spark into Elasticsearch.

New Templates to load data from and into Parquet HDFS files

Two new dedicated Templates have been added to load data from Parquet HDFS Files into Spark, and to load data from Spark into HDFS Parquet Files.

Improve datatype conversion

Datatype conversion between various systems when working with Spark has been improved to better handle the different datatypes.

Fix kerberos command under Windows environment

An issue about kerberos command launched under Windows environment has been fixed.

The "kinit" command launched for initializing kerberos security was not formed properly for Windows environments.

Fix issue with partition truncation when loading data to Hive using SCD mode

When loading data from Spark into Hive through SCD mode, there was an issue when the changes on the data lead to a partition truncation.

In this situation Template execution would fail when trying to get the partitions to truncate.

This issue has been fixed.

Change Log

Version 5.3.8 (Component Pack)

New Features

  • DI-5872: The decimal precision defined is unexpectedly ignored.

Version 3.0.0 (Component Pack)

New Features

  • DI-4508: Update Components and Designer to take into account dedicated license permissions

  • DI-4727: Rebranding: Templates and sample projects

  • DI-4731: Rebranding: Template messages

  • DI-4962: Improved component dependencies and requirements management

Version 2.1.0 (Spark Component)

Bug Fixes

  • DI-3028: Mappings - Spark Templates were unexpectedly proposed in some situations in Mappings even when Spark was not involved

Version 2.0.5 (Spark Component)

New Features

  • DI-4011: Addition of ability to specify deploy mode (cluster or client)

  • DI-4012: Addition of ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL

  • DI-4042: Addition of ability to handle spark session configuration when multiple targets are loaded with Spark

Bug Fixes

  • DI-4037: Template LOAD Hdfs File to Spark - The error message has been updated to be more clearer (instead of NullPointerException) when an error occurs while reading a file

Version 2.0.4 (Spark Component)

New Features

  • DI-3580: Template - LOAD Hdfs File to Spark - new parameter "In File List"

  • DI-3581: Template - LOAD Hdfs File to Spark - support compressed files when using fileDriver Read Method

  • DI-3719: Template - Load Hdfs Json to Spark - new Template to load JSON files stored in HDFS into Spark

  • DI-3800: TOOL - Spark Execution Unit Launcher- add number of partitions to debug prints

Bug Fixes

  • DI-3579: Templates - use createOrReplaceTempView instead of registerTempTable which is deprecated

Version 2.0.3 (Spark Component)

New Features

  • DI-2002: Spark - add support for HTTPS and Kerberized Livy connections

  • DI-2539: Spark - enhance datatype conversion when reading through JDBC

Bug Fixes

  • DI-2540: Spark - when a Spark session executed by Livy fails, the error cause is unexpectedly not returned

Version 2.0.2 (Spark Component)

New Features

  • DI-1912: Templates updated - support having CDC sources on Templates which were not supporting it (such as staging templates)

  • DI-1909: Templates updated - New Parameters 'Unlock Cdc Table' and 'Lock Cdc Table' to configure the behaviour of CDC tables locking

Bug Fixes

  • DI-1907: Templates updated - The 'Cdc Subscriber' parameter was ignored in some Templates when querying the source data