Spark Component Release Notes
Handle spark session configuration when multiple targets are loaded with Spark
Spark will always use the same session within a single Spark Program, with previous version, when the session was created then it could not be configured anymore. As a consequence, we must know all the various session configurations at the beginning of the execution.
Spark component has been updated to be able to handle spark session configuration when multiple targets are loaded with Spark. Spark Submit TOOL
When submitting a Spark job, it is possible to specify one or more resource files that will be made available for each Spark worker.
The Spark Submit TOOL has been updated to allow the ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL.
Change Data Capture (CDC)
Multiple improvements have been performed to homogenize the usage of Change Data Capture (CDC) in the various Components.
Parameters have been homogenized, so that all Templates should now have the same CDC Parameters, with the same support of features.
Multiple fixes have also been performed to correct CDC issues. Refer to the changelog for the exact list of changes.
New Templates to load data from and into Elasticsearch
Two new dedicated Templates have been added to load data from Elasticsearch into Spark, and to load data from Spark into Elasticsearch.
New Templates to load data from and into Parquet HDFS files
Two new dedicated Templates have been added to load data from Parquet HDFS Files into Spark, and to load data from Spark into HDFS Parquet Files.
Improve datatype conversion
Datatype conversion between various systems when working with Spark has been improved to better handle the different datatypes.
Fix kerberos command under Windows environment
An issue about kerberos command launched under Windows environment has been fixed.
The "kinit" command launched for initializing kerberos security was not formed properly for Windows environments.
Fix issue with partition truncation when loading data to Hive using SCD mode
When loading data from Spark into Hive through SCD mode, there was an issue when the changes on the data lead to a partition truncation.
In this situation Template execution would fail when trying to get the partitions to truncate.
This issue has been fixed.
Version 5.3.8 (Component Pack)
Version 3.0.0 (Component Pack)
Version 2.1.0 (Spark Component)
Version 2.0.5 (Spark Component)
DI-4011: Addition of ability to specify deploy mode (cluster or client)
DI-4012: Addition of ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL
DI-4042: Addition of ability to handle spark session configuration when multiple targets are loaded with Spark
Version 2.0.4 (Spark Component)
DI-3580: Template - LOAD Hdfs File to Spark - new parameter "In File List"
DI-3581: Template - LOAD Hdfs File to Spark - support compressed files when using fileDriver Read Method
DI-3719: Template - Load Hdfs Json to Spark - new Template to load JSON files stored in HDFS into Spark
DI-3800: TOOL - Spark Execution Unit Launcher- add number of partitions to debug prints
Version 2.0.3 (Spark Component)
DI-2002: Spark - add support for HTTPS and Kerberized Livy connections
DI-2539: Spark - enhance datatype conversion when reading through JDBC
Version 2.0.2 (Spark Component)
DI-1912: Templates updated - support having CDC sources on Templates which were not supporting it (such as staging templates)
DI-1909: Templates updated - New Parameters 'Unlock Cdc Table' and 'Lock Cdc Table' to configure the behaviour of CDC tables locking