Spark Component Release Notes
This page lists the main features added to the Spark Component.
Feature Highlights
Version 2.0.5
Handle spark session configuration when multiple targets are loaded with Spark
Spark will always use the same session within a single Spark Program, with previous version, when the session was created then it could not be configured anymore. As a consequence, we must know all the various session configurations at the beginning of the execution.
Spark component has been updated to be able to handle spark session configuration when multiple targets are loaded with Spark. Spark Submit TOOL
When submitting a Spark job, it is possible to specify one or more resource files that will be made available for each Spark worker.
The Spark Submit TOOL has been updated to allow the ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL.
Version 2.0.2
Change Data Capture (CDC)
Multiple improvements have been performed to homogenize the usage of Change Data Capture (CDC) in the various Components.
Parameters have been homogenized, so that all Templates should now have the same CDC Parameters, with the same support of features.
Multiple fixes have also been performed to correct CDC issues. Refer to the changelog for the exact list of changes.
Version 2.0.1
New Templates to load data from and into Elasticsearch
Two new dedicated Templates have been added to load data from Elasticsearch into Spark, and to load data from Spark into Elasticsearch.
New Templates to load data from and into Parquet HDFS files
Two new dedicated Templates have been added to load data from Parquet HDFS Files into Spark, and to load data from Spark into HDFS Parquet Files.
Improve datatype conversion
Datatype conversion between various systems when working with Spark has been improved to better handle the different datatypes.
Fix kerberos command under Windows environment
An issue about kerberos command launched under Windows environment has been fixed.
The "kinit" command launched for initializing kerberos security was not formed properly for Windows environments.
Fix issue with partition truncation when loading data to Hive using SCD mode
When loading data from Spark into Hive through SCD mode, there was an issue when the changes on the data lead to a partition truncation.
In this situation Template execution would fail when trying to get the partitions to truncate.
This issue has been fixed.
Change Log
Version 2023.1.0
Version 5.3.8 (Component Pack)
New Features
-
DI-5872: The decimal precision defined is unexpectedly ignored.
-
DI-6088: The LOAD Hdfs XML to Spark template is now available.
-
DI-6397: Spark 1.6 templates have been removed.
-
DI-6571: The LOAD Hdfs XML to Spark template has been updated to ensure that the datastore and column names are not truncated.
Version 2.0.5 (Spark Component)
New Features
-
DI-4011: Addition of ability to specify deploy mode (cluster or client)
-
DI-4012: Addition of ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL
-
DI-4042: Addition of ability to handle spark session configuration when multiple targets are loaded with Spark
Version 2.0.4 (Spark Component)
New Features
-
DI-3580: Template - LOAD Hdfs File to Spark - new parameter "In File List"
-
DI-3581: Template - LOAD Hdfs File to Spark - support compressed files when using fileDriver Read Method
-
DI-3719: Template - Load Hdfs Json to Spark - new Template to load JSON files stored in HDFS into Spark
-
DI-3800: TOOL - Spark Execution Unit Launcher- add number of partitions to debug prints