Spark Component Release Notes
Spark will always use the same session within a single Spark Program, with previous version, when the session was created then it could not be configured anymore. As a consequence, we must know all the various session configurations at the beginning of the execution.
Spark component has been updated to be able to handle spark session configuration when multiple targets are loaded with Spark. Spark Submit TOOL
When submitting a Spark job, it is possible to specify one or more resource files that will be made available for each Spark worker.
The Spark Submit TOOL has been updated to allow the ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL.
Multiple improvements have been performed to homogenize the usage of Change Data Capture (CDC) in the various Components.
Parameters have been homogenized, so that all Templates should now have the same CDC Parameters, with the same support of features.
Multiple fixes have also been performed to correct CDC issues. Refer to the changelog for the exact list of changes.
Two new dedicated Templates have been added to load data from Elasticsearch into Spark, and to load data from Spark into Elasticsearch.
Two new dedicated Templates have been added to load data from Parquet HDFS Files into Spark, and to load data from Spark into HDFS Parquet Files.
Datatype conversion between various systems when working with Spark has been improved to better handle the different datatypes.
An issue about kerberos command launched under Windows environment has been fixed.
The "kinit" command launched for initializing kerberos security was not formed properly for Windows environments.
When loading data from Spark into Hive through SCD mode, there was an issue when the changes on the data lead to a partition truncation.
In this situation Template execution would fail when trying to get the partitions to truncate.
This issue has been fixed.
DI-4011: Addition of ability to specify deploy mode (cluster or client)
DI-4012: Addition of ability to work with resource files through a new dedicated node in Metadata and a new parameter on the Spark Submit TOOL
DI-4042: Addition of ability to handle spark session configuration when multiple targets are loaded with Spark
DI-3580: Template - LOAD Hdfs File to Spark - new parameter "In File List"
DI-3581: Template - LOAD Hdfs File to Spark - support compressed files when using fileDriver Read Method
DI-3719: Template - Load Hdfs Json to Spark - new Template to load JSON files stored in HDFS into Spark
DI-3800: TOOL - Spark Execution Unit Launcher- add number of partitions to debug prints
DI-2002: Spark - add support for HTTPS and Kerberized Livy connections
DI-2539: Spark - enhance datatype conversion when reading through JDBC
DI-1912: Templates updated - support having CDC sources on Templates which were not supporting it (such as staging templates)
DI-1909: Templates updated - New Parameters 'Unlock Cdc Table' and 'Lock Cdc Table' to configure the behaviour of CDC tables locking